poster

Efficient task replication for fast response times in parallel computation

Authors:
Da Wang

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

,
Gauri Joshi

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

,
Gregory Wornell

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systemsJune 2014Pages 599–600https://doi.org/10.1145/2591971.2592042

Published:16 June 2014Publication History

SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systems

Pages 599–600

ABSTRACT

Large-scale distributed computing systems divide a job into many independent tasks and run them in parallel on different machines. A challenge in such parallel computing is that the time taken by a machine to execute a task is inherently variable, and thus the slowest machine becomes the bottleneck in the completion of the job. One way to combat the variability in machine response is to replicate tasks on multiple machines and waiting for the machine that finishes first. While task replication reduces response time, it generally increases resource usage. In this work, we propose a theoretical framework to analyze the trade-off between response time and resource usage. Given an execution time distribution for machines, our analysis gives insights on when and why replication helps. We also propose efficient scheduling algorithms for large-scale distributed computing systems.

References

Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. Effective straggler mitigation: Attack of the clones. NSDI'13, USENIX, pp. 185--198. Google ScholarDigital Library
Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., and Harris, E. Reining in the outliers in map-reduce clusters using mantri. OSDI'10, USENIX, pp. 1--16. Google ScholarDigital Library
Cirne, W., Brasileiro, F., Paranhos, D., Luís Fabrício W. Góes, and Voorsluys, W. On the efficacy, efficiency and emergent behavior of task replication in large distributed systems. Parallel Computing 33, 3 (2007), 213--234. Google ScholarDigital Library
Dean, J., and Barroso, L. A. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarDigital Library
Dean, J., and Ghemawat, S. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
Ghare, G. D., and Leutenegger, S. T. Improving speedup and response times by replicating parallel programs on a SNOW. In Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, Jan. 2005, pp. 264--287. Google ScholarDigital Library
Joshi, G., Liu, Y., and Soljanin, E. Coding for fast content download. In Proc. Annu. Allerton Conf. on Commu. Control. & Comput. (Oct. 2012), pp. 326--333.Google ScholarCross Ref
Joshi, G., Liu, Y., and Soljanin, E. On the Delay-Storage trade-off in content download from coded distributed storage systems. arXiv:1305.3945 {cs, math} (May 2013).Google Scholar
Shah, N. B., Lee, K., and Ramachandran, K. When do redundant requests reduce latency? arXiv:1311.2851 {cs} (Nov. 2013).Google Scholar
Vulimiri, A., Godfrey, P. B., Mittal, R., Sherry, J., Ratnasamy, S., and Shenker, S. Low latency via redundancy. arXiv:1306.3707 {cs} (June 2013).Google Scholar
Wang, D., Joshi, G., and Wornell, G. Efficient job replication for fast response times in parallel computation. arXiv:1404.1328 {cs.DC} (Apr. 2014). Google ScholarDigital Library
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and Stoica, I. Improving MapReduce performance in heterogeneous environments. OSDI'08, USENIX, pp. 29--42. Google ScholarDigital Library

Index Terms

Efficient task replication for fast response times in parallel computation

Recommendations

Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing

In cloud computing jobs consisting of many tasks run in parallel, the tasks on the slowest machines (straggling tasks) become the bottleneck in the completion of the job. One way to combat the variability in machine response time is to add replicas of ...
Read More
Efficient task replication for fast response times in parallel computation
Performance evaluation review

Large-scale distributed computing systems divide a job into many independent tasks and run them in parallel on different machines. A challenge in such parallel computing is that the time taken by a machine to execute a task is inherently variable, and ...
Read More
On the efficacy, efficiency and emergent behavior of task replication in large distributed systems

Large distributed systems challenge traditional schedulers, as it is often hard to determine a priori how long each task will take to complete on each resource, information that is input for such schedulers. Task replication has been applied in a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systems
June 2014
614 pages
ISBN:9781450327893
DOI:10.1145/2591971
General Chairs:
Sujay Sanghavi
The University of Texas at Austin
,
Sanjay Shakkottai
The University of Texas at Austin
,
Program Chairs:
Marc Lelarge
INRIA, France
,
Bianca Schroeder
University of Toronto
ACM SIGMETRICS Performance Evaluation Review Volume 42, Issue 1
Performance evaluation review
June 2014
569 pages
ISSN:0163-5999
DOI:10.1145/2637364
Editors:
Derek Eager
University of Saskatchewan
,
Carey Williamson
University of Calgary
Issue’s Table of Contents
Copyright © 2014 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 June 2014
Check for updates
Author Tags
backup tasks
latency
parallel computation
redundancy
scheduling
task replication
variability
Qualifiers
- poster
Conference

Acceptance Rates
SIGMETRICS '14 Paper Acceptance Rate40of237submissions,17%Overall Acceptance Rate459of2,691submissions,17%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 85
  Total Citations
  View Citations
- 417
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient task replication for fast response times in parallel computation

SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing

Efficient task replication for fast response times in parallel computation

On the efficacy, efficiency and emergent behavior of task replication in large distributed systems