ABSTRACT
While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. In this paper, we propose a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation (S2C2). S2C2 squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data. We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes. We evaluate S2C2 on linear algebraic algorithms, gradient descent, graph ranking, and graph filtering algorithms. We demonstrate 19% to 39% reduction in total computation latency using S2C2 compared to job replication and coded computation. We further show how S2C2 can be applied beyond matrix-vector multiplication.
- 2000. CS Toronto Datasets. http://www.cs.toronto.edu/~tsap/experiments/datasets/index.html, last accessed on 08/01/2017.Google Scholar
- 2014. Apache Hadoop. http://hadoop.apache.org/, last accessed on 08/01/2017.Google Scholar
- 2019. Digital Ocean. https://www.digitalocean.com/, last accessed on 08/01/2019.Google Scholar
- Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice (SC).Google Scholar
- Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (nsdi'13). USENIX Association, Berkeley, CA, USA, 185--198. http://dl.acm.org/citation.cfm?id=2482626.2482645Google ScholarDigital Library
- Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri.. In OSDI, Vol. 10. 24.Google ScholarDigital Library
- Manmohan Chaubey and Erik Saule. 2015. Replicated Data Placement for Uncertain Scheduling. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW '15). IEEE Computer Society, Washington, DC, USA, 464--472. Google ScholarDigital Library
- Jeffrey Dean and Luiz AndrÃl' Barroso. 2013. The Tail at Scale. Commun. ACM 56 (2013), 74--80. http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltextGoogle ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 127--144. Google ScholarDigital Library
- Peter A. Dinda. 2001. Online Prediction of the Running Time of Tasks. In Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '01). ACM, New York, NY, USA, 336--337. Google ScholarDigital Library
- Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2016. Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products. In Advances In Neural Information Processing Systems. 2092--2100.Google Scholar
- Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, and Esa Hyytia. 2015. Reducing Latency via Redundant Requests: Exact Analysis. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '15). ACM, New York, NY, USA, 347--360. Google ScholarDigital Library
- Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. 2015. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 383--397. Google ScholarDigital Library
- Vipul Gupta, Shusen Wang, Thomas Courtade, and Kannan Ramchandran. 2018. OverSketch: Approximate Matrix Multiplication for the Cloud. arXiv preprint arXiv:1811.02653 (2018).Google Scholar
- Raymond Hill. 1986. A First Course in Coding Theory. Clarendon Press. https://books.google.com/books?id=UTxjBX9lKoMCGoogle Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
- Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Ronald G. Dreslinski, Jason Mars, and Lingjia Tang. 2017. Reining in Long Tails in Warehouse-Scale Computers with Quick Voltage Boosting Using Adrenaline. ACM Trans. Comput. Syst. 35, 1, Article 2 (March 2017), 33 pages. Google ScholarDigital Library
- Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient Cache Sharing with Strict Qos for Latency-critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 729--742. Google ScholarDigital Library
- Jack Kosaian, KV Rashmi, and Shivaram Venkataraman. 2019. Parity Models: A General Framework for Coding-Based Resilience in ML Inference. arXiv preprint arXiv:1905.00863 (2019).Google Scholar
- J. Kosaian, K. V. Rashmi, and S. Venkataraman. 2018. Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation. ArXiv e-prints (June 2018). arXiv:cs.LG/1806.01259Google Scholar
- Sanjeev Krishnan Laxmikant Kale. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of OOPSLA'93. ACM Press, 91--108.Google ScholarDigital Library
- Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2016. Speeding up distributed machine learning using codes. In 2016 IEEE International Symposium on Information Theory (ISIT). 1143--1147. Google ScholarCross Ref
- Kangwook Lee, Ramtin Pedarsani, and Kannan Ramchandran. 2017. On Scheduling Redundant Requests With Cancellation Overheads. IEEE/ACM Trans. Netw. 25, 2 (April 2017), 1279--1290. Google ScholarDigital Library
- Jacob Leverich and Christos Kozyrakis. 2014. Reconciling High Server Utilization and Sub-millisecond Quality-of-service. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys '14). ACM, New York, NY, USA, Article 4, 14 pages. Google ScholarDigital Library
- Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 9, 14 pages. Google ScholarDigital Library
- Songze Li, Mohammad Ali Maddah-Ali, and Amir Salman Avestimehr. 2015. Coded MapReduce. 53rd Allerton Conference (Sept. 2015).Google Scholar
- Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. 2016. A Unified Coding Framework for Distributed Computing with Straggling Servers. e-print arXiv:1609.01690 (Sept. 2016). A shorter version to appear in IEEE NetCod 2016.Google Scholar
- Songze Li, Mohammad Ali Maddah-Ali, Qian Yu, and A Salman Avestimehr. 2016. A Fundamental Tradeoff between Computation and Communication in Distributed Computing. to appear in IEEE Transactions on Information Theory (2016).Google Scholar
- Songze Li, Sucha Supittayapornpong, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. 2017. Coded Terasort. in proceedings of 2017 International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics (2017).Google ScholarCross Ref
- M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards Energy Proportionality for Large-scale Latency-critical Workloads. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 301--312. http://dl.acm.org/citation.cfm?id=2665671.2665718Google ScholarDigital Library
- Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 893--907.Google ScholarDigital Library
- Krishna Giri Narra, Zhifeng Lin, Ganesh Ananthanarayanan, Salman Avestimehr, and Murali Annavaram. 2019. Collage Inference: Tolerating Stragglers in Distributed Neural Network Inference using Coding. arXiv preprint arXiv:1904.12222 (2019).Google Scholar
- Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Amir Salman Avestimehr. 2017. Coded Computation over Heterogeneous Clusters. In 2017 IEEE International Symposium on Information Theory (ISIT).Google ScholarCross Ref
- Paul Renteln. 2013. Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physicists. Cambridge University Press.Google Scholar
- Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. 2013. When do redundant requests reduce latency?. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). 731--738. Google ScholarCross Ref
- Junhyun So, Basak Guker, Payman Mohassel, and Salman Avestimehr. 2019. CodedPrivateML: A Fast and Privacy-Preserving Framework for Distributed Machine Learning. arXiv preprint arXiv:1902.00641 (2019).Google Scholar
- Rashish Tandon, Qi Lei, Alexandras G Dimakis, and Nikos Karampatziakis. 2016. Gradient Coding. arXiv preprint arXiv:1612.03301 (2016).Google Scholar
- Da Wang, Gauri Joshi, and Gregory Wornell. 2014. Efficient Task Replication for Fast Response Times in Parallel Computation. SIGMETRICS Perform. Eval. Rev. 42, 1 (June 2014), 599--600. Google ScholarDigital Library
- Rich Wolski, Neil Spring, and Chris Peterson. 1997. Implementing a Performance Forecasting System for Metacomputing: The Network Weather Service. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (SC '97). ACM, New York, NY, USA, 1--19. Google ScholarDigital Library
- Yaoqing Yang, Pulkit Grover, and Soummya Kar. 2017. Coding Method for Parallel Iterative Linear Solver. to appear Advances In Neural Information Processing Systems (NIPS) (2017).Google Scholar
- Qian Yu, Mohammad Maddah-Ali, and A. Salman Avestimehr. 2017. Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication. In to appear Advances In Neural Information Processing Systems (NIPS).Google Scholar
- Qian Yu, Netanel Raviv, Jinhyun So, and A Salman Avestimehr. 2019. Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy. in Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS) (2019).Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113Google ScholarDigital Library
- Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744Google ScholarDigital Library
- Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. 2016. Tread-mill: Attributing the Source of Tail Latency Through Precise Load Testing and Statistical Inference. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 456--468. Google ScholarDigital Library
- Jingge Zhu, Ye Pu, Vipul Gupta, Claire Tomlin, and Kannan Ramchandran. 2017. A Sequential Approximation Framework for Coded Distributed Optimization. CoRR abs/1710.09001 (2017). arXiv:1710.09001 http://arxiv.org/abs/1710.09001Google Scholar
- Timothy Zhu, Michael A. Kozuch, and Mor Harchol-Balter. 2017. WorkloadCompactor: Reducing Datacenter Cost While Providing Tail Latency SLO Guarantees. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, New York, NY, USA, 598--610. Google ScholarDigital Library
Index Terms
- Slack squeeze coded computing for adaptive straggler mitigation
Recommendations
Timely-Throughput Optimal Coded Computing over Cloud Networks
Mobihoc '19: Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and ComputingIn modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of computing resources. Meanwhile, there is significantly increasing demand for timely and event-driven services with deadline constraints. ...
Hybrid gradient descent cuckoo search (HGDCS) algorithm for resource scheduling in IaaS cloud computing environment
Resource scheduling is a procedure for the distribution of resources over time to perform a required task and a decision making process in cloud computing. Optimal resource scheduling is a great challenge and considered to be an NP-hard problem due to ...
Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres
AbstractCloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby ...
Comments