skip to main content
10.1145/3295500.3356170acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access
Artifacts Available

Slack squeeze coded computing for adaptive straggler mitigation

Published:17 November 2019Publication History

ABSTRACT

While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. In this paper, we propose a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation (S2C2). S2C2 squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data. We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes. We evaluate S2C2 on linear algebraic algorithms, gradient descent, graph ranking, and graph filtering algorithms. We demonstrate 19% to 39% reduction in total computation latency using S2C2 compared to job replication and coded computation. We further show how S2C2 can be applied beyond matrix-vector multiplication.

References

  1. 2000. CS Toronto Datasets. http://www.cs.toronto.edu/~tsap/experiments/datasets/index.html, last accessed on 08/01/2017.Google ScholarGoogle Scholar
  2. 2014. Apache Hadoop. http://hadoop.apache.org/, last accessed on 08/01/2017.Google ScholarGoogle Scholar
  3. 2019. Digital Ocean. https://www.digitalocean.com/, last accessed on 08/01/2019.Google ScholarGoogle Scholar
  4. Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice (SC).Google ScholarGoogle Scholar
  5. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (nsdi'13). USENIX Association, Berkeley, CA, USA, 185--198. http://dl.acm.org/citation.cfm?id=2482626.2482645Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri.. In OSDI, Vol. 10. 24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Manmohan Chaubey and Erik Saule. 2015. Replicated Data Placement for Uncertain Scheduling. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW '15). IEEE Computer Society, Washington, DC, USA, 464--472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jeffrey Dean and Luiz AndrÃl' Barroso. 2013. The Tail at Scale. Commun. ACM 56 (2013), 74--80. http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltextGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 127--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Peter A. Dinda. 2001. Online Prediction of the Running Time of Tasks. In Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '01). ACM, New York, NY, USA, 336--337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2016. Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products. In Advances In Neural Information Processing Systems. 2092--2100.Google ScholarGoogle Scholar
  12. Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, and Esa Hyytia. 2015. Reducing Latency via Redundant Requests: Exact Analysis. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '15). ACM, New York, NY, USA, 347--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. 2015. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 383--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Vipul Gupta, Shusen Wang, Thomas Courtade, and Kannan Ramchandran. 2018. OverSketch: Approximate Matrix Multiplication for the Cloud. arXiv preprint arXiv:1811.02653 (2018).Google ScholarGoogle Scholar
  15. Raymond Hill. 1986. A First Course in Coding Theory. Clarendon Press. https://books.google.com/books?id=UTxjBX9lKoMCGoogle ScholarGoogle Scholar
  16. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Ronald G. Dreslinski, Jason Mars, and Lingjia Tang. 2017. Reining in Long Tails in Warehouse-Scale Computers with Quick Voltage Boosting Using Adrenaline. ACM Trans. Comput. Syst. 35, 1, Article 2 (March 2017), 33 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient Cache Sharing with Strict Qos for Latency-critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 729--742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jack Kosaian, KV Rashmi, and Shivaram Venkataraman. 2019. Parity Models: A General Framework for Coding-Based Resilience in ML Inference. arXiv preprint arXiv:1905.00863 (2019).Google ScholarGoogle Scholar
  20. J. Kosaian, K. V. Rashmi, and S. Venkataraman. 2018. Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation. ArXiv e-prints (June 2018). arXiv:cs.LG/1806.01259Google ScholarGoogle Scholar
  21. Sanjeev Krishnan Laxmikant Kale. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of OOPSLA'93. ACM Press, 91--108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2016. Speeding up distributed machine learning using codes. In 2016 IEEE International Symposium on Information Theory (ISIT). 1143--1147. Google ScholarGoogle ScholarCross RefCross Ref
  23. Kangwook Lee, Ramtin Pedarsani, and Kannan Ramchandran. 2017. On Scheduling Redundant Requests With Cancellation Overheads. IEEE/ACM Trans. Netw. 25, 2 (April 2017), 1279--1290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jacob Leverich and Christos Kozyrakis. 2014. Reconciling High Server Utilization and Sub-millisecond Quality-of-service. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys '14). ACM, New York, NY, USA, Article 4, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 9, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Songze Li, Mohammad Ali Maddah-Ali, and Amir Salman Avestimehr. 2015. Coded MapReduce. 53rd Allerton Conference (Sept. 2015).Google ScholarGoogle Scholar
  27. Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. 2016. A Unified Coding Framework for Distributed Computing with Straggling Servers. e-print arXiv:1609.01690 (Sept. 2016). A shorter version to appear in IEEE NetCod 2016.Google ScholarGoogle Scholar
  28. Songze Li, Mohammad Ali Maddah-Ali, Qian Yu, and A Salman Avestimehr. 2016. A Fundamental Tradeoff between Computation and Communication in Distributed Computing. to appear in IEEE Transactions on Information Theory (2016).Google ScholarGoogle Scholar
  29. Songze Li, Sucha Supittayapornpong, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. 2017. Coded Terasort. in proceedings of 2017 International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics (2017).Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle ScholarGoogle Scholar
  31. David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards Energy Proportionality for Large-scale Latency-critical Workloads. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 301--312. http://dl.acm.org/citation.cfm?id=2665671.2665718Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 893--907.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Krishna Giri Narra, Zhifeng Lin, Ganesh Ananthanarayanan, Salman Avestimehr, and Murali Annavaram. 2019. Collage Inference: Tolerating Stragglers in Distributed Neural Network Inference using Coding. arXiv preprint arXiv:1904.12222 (2019).Google ScholarGoogle Scholar
  34. Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Amir Salman Avestimehr. 2017. Coded Computation over Heterogeneous Clusters. In 2017 IEEE International Symposium on Information Theory (ISIT).Google ScholarGoogle ScholarCross RefCross Ref
  35. Paul Renteln. 2013. Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physicists. Cambridge University Press.Google ScholarGoogle Scholar
  36. Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. 2013. When do redundant requests reduce latency?. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). 731--738. Google ScholarGoogle ScholarCross RefCross Ref
  37. Junhyun So, Basak Guker, Payman Mohassel, and Salman Avestimehr. 2019. CodedPrivateML: A Fast and Privacy-Preserving Framework for Distributed Machine Learning. arXiv preprint arXiv:1902.00641 (2019).Google ScholarGoogle Scholar
  38. Rashish Tandon, Qi Lei, Alexandras G Dimakis, and Nikos Karampatziakis. 2016. Gradient Coding. arXiv preprint arXiv:1612.03301 (2016).Google ScholarGoogle Scholar
  39. Da Wang, Gauri Joshi, and Gregory Wornell. 2014. Efficient Task Replication for Fast Response Times in Parallel Computation. SIGMETRICS Perform. Eval. Rev. 42, 1 (June 2014), 599--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Rich Wolski, Neil Spring, and Chris Peterson. 1997. Implementing a Performance Forecasting System for Metacomputing: The Network Weather Service. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (SC '97). ACM, New York, NY, USA, 1--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yaoqing Yang, Pulkit Grover, and Soummya Kar. 2017. Coding Method for Parallel Iterative Linear Solver. to appear Advances In Neural Information Processing Systems (NIPS) (2017).Google ScholarGoogle Scholar
  42. Qian Yu, Mohammad Maddah-Ali, and A. Salman Avestimehr. 2017. Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication. In to appear Advances In Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  43. Qian Yu, Netanel Raviv, Jinhyun So, and A Salman Avestimehr. 2019. Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy. in Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS) (2019).Google ScholarGoogle Scholar
  44. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. 2016. Tread-mill: Attributing the Source of Tail Latency Through Precise Load Testing and Statistical Inference. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 456--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jingge Zhu, Ye Pu, Vipul Gupta, Claire Tomlin, and Kannan Ramchandran. 2017. A Sequential Approximation Framework for Coded Distributed Optimization. CoRR abs/1710.09001 (2017). arXiv:1710.09001 http://arxiv.org/abs/1710.09001Google ScholarGoogle Scholar
  48. Timothy Zhu, Michael A. Kozuch, and Mor Harchol-Balter. 2017. WorkloadCompactor: Reducing Datacenter Cost While Providing Tail Latency SLO Guarantees. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, New York, NY, USA, 598--610. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Slack squeeze coded computing for adaptive straggler mitigation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
          November 2019
          1921 pages
          ISBN:9781450362290
          DOI:10.1145/3295500

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 November 2019

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,516of6,373submissions,24%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader