ABSTRACT
Large-scale distributed computing systems divide a job into many independent tasks and run them in parallel on different machines. A challenge in such parallel computing is that the time taken by a machine to execute a task is inherently variable, and thus the slowest machine becomes the bottleneck in the completion of the job. One way to combat the variability in machine response is to replicate tasks on multiple machines and waiting for the machine that finishes first. While task replication reduces response time, it generally increases resource usage. In this work, we propose a theoretical framework to analyze the trade-off between response time and resource usage. Given an execution time distribution for machines, our analysis gives insights on when and why replication helps. We also propose efficient scheduling algorithms for large-scale distributed computing systems.
- Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. Effective straggler mitigation: Attack of the clones. NSDI'13, USENIX, pp. 185--198. Google ScholarDigital Library
- Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., and Harris, E. Reining in the outliers in map-reduce clusters using mantri. OSDI'10, USENIX, pp. 1--16. Google ScholarDigital Library
- Cirne, W., Brasileiro, F., Paranhos, D., Luís Fabrício W. Góes, and Voorsluys, W. On the efficacy, efficiency and emergent behavior of task replication in large distributed systems. Parallel Computing 33, 3 (2007), 213--234. Google ScholarDigital Library
- Dean, J., and Barroso, L. A. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarDigital Library
- Dean, J., and Ghemawat, S. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- Ghare, G. D., and Leutenegger, S. T. Improving speedup and response times by replicating parallel programs on a SNOW. In Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, Jan. 2005, pp. 264--287. Google ScholarDigital Library
- Joshi, G., Liu, Y., and Soljanin, E. Coding for fast content download. In Proc. Annu. Allerton Conf. on Commu. Control. & Comput. (Oct. 2012), pp. 326--333.Google ScholarCross Ref
- Joshi, G., Liu, Y., and Soljanin, E. On the Delay-Storage trade-off in content download from coded distributed storage systems. arXiv:1305.3945 {cs, math} (May 2013).Google Scholar
- Shah, N. B., Lee, K., and Ramachandran, K. When do redundant requests reduce latency? arXiv:1311.2851 {cs} (Nov. 2013).Google Scholar
- Vulimiri, A., Godfrey, P. B., Mittal, R., Sherry, J., Ratnasamy, S., and Shenker, S. Low latency via redundancy. arXiv:1306.3707 {cs} (June 2013).Google Scholar
- Wang, D., Joshi, G., and Wornell, G. Efficient job replication for fast response times in parallel computation. arXiv:1404.1328 {cs.DC} (Apr. 2014). Google ScholarDigital Library
- Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and Stoica, I. Improving MapReduce performance in heterogeneous environments. OSDI'08, USENIX, pp. 29--42. Google ScholarDigital Library
Index Terms
- Efficient task replication for fast response times in parallel computation
Recommendations
Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing
In cloud computing jobs consisting of many tasks run in parallel, the tasks on the slowest machines (straggling tasks) become the bottleneck in the completion of the job. One way to combat the variability in machine response time is to add replicas of ...
Efficient task replication for fast response times in parallel computation
Performance evaluation reviewLarge-scale distributed computing systems divide a job into many independent tasks and run them in parallel on different machines. A challenge in such parallel computing is that the time taken by a machine to execute a task is inherently variable, and ...
On the efficacy, efficiency and emergent behavior of task replication in large distributed systems
Large distributed systems challenge traditional schedulers, as it is often hard to determine a priori how long each task will take to complete on each resource, information that is input for such schedulers. Task replication has been applied in a ...
Comments