skip to main content
10.1145/2591971.2592042acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
poster

Efficient task replication for fast response times in parallel computation

Published:16 June 2014Publication History

ABSTRACT

Large-scale distributed computing systems divide a job into many independent tasks and run them in parallel on different machines. A challenge in such parallel computing is that the time taken by a machine to execute a task is inherently variable, and thus the slowest machine becomes the bottleneck in the completion of the job. One way to combat the variability in machine response is to replicate tasks on multiple machines and waiting for the machine that finishes first. While task replication reduces response time, it generally increases resource usage. In this work, we propose a theoretical framework to analyze the trade-off between response time and resource usage. Given an execution time distribution for machines, our analysis gives insights on when and why replication helps. We also propose efficient scheduling algorithms for large-scale distributed computing systems.

References

  1. Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. Effective straggler mitigation: Attack of the clones. NSDI'13, USENIX, pp. 185--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., and Harris, E. Reining in the outliers in map-reduce clusters using mantri. OSDI'10, USENIX, pp. 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cirne, W., Brasileiro, F., Paranhos, D., Luís Fabrício W. Góes, and Voorsluys, W. On the efficacy, efficiency and emergent behavior of task replication in large distributed systems. Parallel Computing 33, 3 (2007), 213--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dean, J., and Barroso, L. A. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dean, J., and Ghemawat, S. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ghare, G. D., and Leutenegger, S. T. Improving speedup and response times by replicating parallel programs on a SNOW. In Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, Jan. 2005, pp. 264--287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Joshi, G., Liu, Y., and Soljanin, E. Coding for fast content download. In Proc. Annu. Allerton Conf. on Commu. Control. & Comput. (Oct. 2012), pp. 326--333.Google ScholarGoogle ScholarCross RefCross Ref
  8. Joshi, G., Liu, Y., and Soljanin, E. On the Delay-Storage trade-off in content download from coded distributed storage systems. arXiv:1305.3945 {cs, math} (May 2013).Google ScholarGoogle Scholar
  9. Shah, N. B., Lee, K., and Ramachandran, K. When do redundant requests reduce latency? arXiv:1311.2851 {cs} (Nov. 2013).Google ScholarGoogle Scholar
  10. Vulimiri, A., Godfrey, P. B., Mittal, R., Sherry, J., Ratnasamy, S., and Shenker, S. Low latency via redundancy. arXiv:1306.3707 {cs} (June 2013).Google ScholarGoogle Scholar
  11. Wang, D., Joshi, G., and Wornell, G. Efficient job replication for fast response times in parallel computation. arXiv:1404.1328 {cs.DC} (Apr. 2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and Stoica, I. Improving MapReduce performance in heterogeneous environments. OSDI'08, USENIX, pp. 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient task replication for fast response times in parallel computation

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systems
                  June 2014
                  614 pages
                  ISBN:9781450327893
                  DOI:10.1145/2591971

                  Copyright © 2014 Owner/Author

                  Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 16 June 2014

                  Check for updates

                  Qualifiers

                  • poster

                  Acceptance Rates

                  SIGMETRICS '14 Paper Acceptance Rate40of237submissions,17%Overall Acceptance Rate459of2,691submissions,17%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader