Slack squeeze coded computing for adaptive straggler mitigation

Authors:
Krishna Giri Narra

University of Southern California

University of Southern California
View Profile

,
Zhifeng Lin

University of Southern California

University of Southern California
View Profile

,
Mehrdad Kiamari

University of Southern California

University of Southern California
View Profile

,
Salman Avestimehr

University of Southern California

University of Southern California
View Profile

,
Murali Annavaram

University of Southern California

University of Southern California
View Profile

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2019Article No.: 14Pages 1–16https://doi.org/10.1145/3295500.3356170

Published:17 November 2019Publication History

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–16

ABSTRACT

While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. In this paper, we propose a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation (S²C²). S²C² squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data. We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes. We evaluate S²C² on linear algebraic algorithms, gradient descent, graph ranking, and graph filtering algorithms. We demonstrate 19% to 39% reduction in total computation latency using S²C² compared to job replication and coded computation. We further show how S²C² can be applied beyond matrix-vector multiplication.

References

2000. CS Toronto Datasets. http://www.cs.toronto.edu/~tsap/experiments/datasets/index.html, last accessed on 08/01/2017.Google Scholar
2014. Apache Hadoop. http://hadoop.apache.org/, last accessed on 08/01/2017.Google Scholar
2019. Digital Ocean. https://www.digitalocean.com/, last accessed on 08/01/2019.Google Scholar
Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice (SC).Google Scholar
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (nsdi'13). USENIX Association, Berkeley, CA, USA, 185--198. http://dl.acm.org/citation.cfm?id=2482626.2482645Google ScholarDigital Library
Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri.. In OSDI, Vol. 10. 24.Google ScholarDigital Library
Manmohan Chaubey and Erik Saule. 2015. Replicated Data Placement for Uncertain Scheduling. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW '15). IEEE Computer Society, Washington, DC, USA, 464--472. Google ScholarDigital Library
Jeffrey Dean and Luiz AndrÃl' Barroso. 2013. The Tail at Scale. Commun. ACM 56 (2013), 74--80. http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltextGoogle ScholarDigital Library
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 127--144. Google ScholarDigital Library
Peter A. Dinda. 2001. Online Prediction of the Running Time of Tasks. In Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '01). ACM, New York, NY, USA, 336--337. Google ScholarDigital Library
Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2016. Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products. In Advances In Neural Information Processing Systems. 2092--2100.Google Scholar
Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, and Esa Hyytia. 2015. Reducing Latency via Redundant Requests: Exact Analysis. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '15). ACM, New York, NY, USA, 347--360. Google ScholarDigital Library
Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. 2015. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 383--397. Google ScholarDigital Library
Vipul Gupta, Shusen Wang, Thomas Courtade, and Kannan Ramchandran. 2018. OverSketch: Approximate Matrix Multiplication for the Cloud. arXiv preprint arXiv:1811.02653 (2018).Google Scholar
Raymond Hill. 1986. A First Course in Coding Theory. Clarendon Press. https://books.google.com/books?id=UTxjBX9lKoMCGoogle Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Ronald G. Dreslinski, Jason Mars, and Lingjia Tang. 2017. Reining in Long Tails in Warehouse-Scale Computers with Quick Voltage Boosting Using Adrenaline. ACM Trans. Comput. Syst. 35, 1, Article 2 (March 2017), 33 pages. Google ScholarDigital Library
Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient Cache Sharing with Strict Qos for Latency-critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 729--742. Google ScholarDigital Library
Jack Kosaian, KV Rashmi, and Shivaram Venkataraman. 2019. Parity Models: A General Framework for Coding-Based Resilience in ML Inference. arXiv preprint arXiv:1905.00863 (2019).Google Scholar
J. Kosaian, K. V. Rashmi, and S. Venkataraman. 2018. Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation. ArXiv e-prints (June 2018). arXiv:cs.LG/1806.01259Google Scholar
Sanjeev Krishnan Laxmikant Kale. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of OOPSLA'93. ACM Press, 91--108.Google ScholarDigital Library
Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2016. Speeding up distributed machine learning using codes. In 2016 IEEE International Symposium on Information Theory (ISIT). 1143--1147. Google ScholarCross Ref
Kangwook Lee, Ramtin Pedarsani, and Kannan Ramchandran. 2017. On Scheduling Redundant Requests With Cancellation Overheads. IEEE/ACM Trans. Netw. 25, 2 (April 2017), 1279--1290. Google ScholarDigital Library
Jacob Leverich and Christos Kozyrakis. 2014. Reconciling High Server Utilization and Sub-millisecond Quality-of-service. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys '14). ACM, New York, NY, USA, Article 4, 14 pages. Google ScholarDigital Library
Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 9, 14 pages. Google ScholarDigital Library
Songze Li, Mohammad Ali Maddah-Ali, and Amir Salman Avestimehr. 2015. Coded MapReduce. 53rd Allerton Conference (Sept. 2015).Google Scholar
Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. 2016. A Unified Coding Framework for Distributed Computing with Straggling Servers. e-print arXiv:1609.01690 (Sept. 2016). A shorter version to appear in IEEE NetCod 2016.Google Scholar
Songze Li, Mohammad Ali Maddah-Ali, Qian Yu, and A Salman Avestimehr. 2016. A Fundamental Tradeoff between Computation and Communication in Distributed Computing. to appear in IEEE Transactions on Information Theory (2016).Google Scholar
Songze Li, Sucha Supittayapornpong, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. 2017. Coded Terasort. in proceedings of 2017 International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics (2017).Google ScholarCross Ref
M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards Energy Proportionality for Large-scale Latency-critical Workloads. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 301--312. http://dl.acm.org/citation.cfm?id=2665671.2665718Google ScholarDigital Library
Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 893--907.Google ScholarDigital Library
Krishna Giri Narra, Zhifeng Lin, Ganesh Ananthanarayanan, Salman Avestimehr, and Murali Annavaram. 2019. Collage Inference: Tolerating Stragglers in Distributed Neural Network Inference using Coding. arXiv preprint arXiv:1904.12222 (2019).Google Scholar
Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Amir Salman Avestimehr. 2017. Coded Computation over Heterogeneous Clusters. In 2017 IEEE International Symposium on Information Theory (ISIT).Google ScholarCross Ref
Paul Renteln. 2013. Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physicists. Cambridge University Press.Google Scholar
Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. 2013. When do redundant requests reduce latency?. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). 731--738. Google ScholarCross Ref
Junhyun So, Basak Guker, Payman Mohassel, and Salman Avestimehr. 2019. CodedPrivateML: A Fast and Privacy-Preserving Framework for Distributed Machine Learning. arXiv preprint arXiv:1902.00641 (2019).Google Scholar
Rashish Tandon, Qi Lei, Alexandras G Dimakis, and Nikos Karampatziakis. 2016. Gradient Coding. arXiv preprint arXiv:1612.03301 (2016).Google Scholar
Da Wang, Gauri Joshi, and Gregory Wornell. 2014. Efficient Task Replication for Fast Response Times in Parallel Computation. SIGMETRICS Perform. Eval. Rev. 42, 1 (June 2014), 599--600. Google ScholarDigital Library
Rich Wolski, Neil Spring, and Chris Peterson. 1997. Implementing a Performance Forecasting System for Metacomputing: The Network Weather Service. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (SC '97). ACM, New York, NY, USA, 1--19. Google ScholarDigital Library
Yaoqing Yang, Pulkit Grover, and Soummya Kar. 2017. Coding Method for Parallel Iterative Linear Solver. to appear Advances In Neural Information Processing Systems (NIPS) (2017).Google Scholar
Qian Yu, Mohammad Maddah-Ali, and A. Salman Avestimehr. 2017. Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication. In to appear Advances In Neural Information Processing Systems (NIPS).Google Scholar
Qian Yu, Netanel Raviv, Jinhyun So, and A Salman Avestimehr. 2019. Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy. in Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS) (2019).Google Scholar
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113Google ScholarDigital Library
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744Google ScholarDigital Library
Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. 2016. Tread-mill: Attributing the Source of Tail Latency Through Precise Load Testing and Statistical Inference. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 456--468. Google ScholarDigital Library
Jingge Zhu, Ye Pu, Vipul Gupta, Claire Tomlin, and Kannan Ramchandran. 2017. A Sequential Approximation Framework for Coded Distributed Optimization. CoRR abs/1710.09001 (2017). arXiv:1710.09001 http://arxiv.org/abs/1710.09001Google Scholar
Timothy Zhu, Michael A. Kozuch, and Mor Harchol-Balter. 2017. WorkloadCompactor: Reducing Datacenter Cost While Providing Tail Latency SLO Guarantees. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, New York, NY, USA, 598--610. Google ScholarDigital Library

Index Terms

Slack squeeze coded computing for adaptive straggler mitigation
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
  2. Dependable and fault-tolerant systems and networks
    1. Availability
    2. Redundancy

Recommendations

Timely-Throughput Optimal Coded Computing over Cloud Networks
Mobihoc '19: Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing

In modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of computing resources. Meanwhile, there is significantly increasing demand for timely and event-driven services with deadline constraints. ...
Read More
Hybrid gradient descent cuckoo search (HGDCS) algorithm for resource scheduling in IaaS cloud computing environment

Resource scheduling is a procedure for the distribution of resources over time to perform a required task and a decision making process in cloud computing. Optimal resource scheduling is a great challenge and considered to be an NP-hard problem due to ...
Read More
Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres
Abstract
Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available
Author Tags
cloud computing
coded computing
gradient descent
load balancing
machine learning
matrix multiplication
stragglers
tail latency
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 795
  Total Downloads
- Downloads (Last 12 months)104
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Slack squeeze coded computing for adaptive straggler mitigation

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Timely-Throughput Optimal Coded Computing over Cloud Networks

Hybrid gradient descent cuckoo search (HGDCS) algorithm for resource scheduling in IaaS cloud computing environment

Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres