skip to main content
research-article
Open access

Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication

Published: 25 April 2022 Publication History

Abstract

Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of straggling nodes---computing nodes that unpredictably slow down or fail---is a major bottleneck in such distributed computations. Ideal load balancing strategies that dynamically allocate more tasks to faster nodes require knowledge or monitoring of node speeds as well as the ability to quickly move data. Recently proposed fixed-rate erasure coding strategies can handle unpredictable node slowdown, but they ignore partial work done by straggling nodes, thus resulting in a lot of redundant computation. We propose a rateless fountain coding strategy that achieves the best of both worlds---we prove that its latency is asymptotically equal to ideal load balancing, and it performs asymptotically zero redundant computations. Our idea is to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes. The original matrix-vector product can be decoded as soon as slightly more than m row-vector products are collectively finished by the nodes. Evaluation on parallel and distributed computing yields as much as three times speedup over uncoded schemes.

References

[1]
Amazon. Amazon web services EC2, 2006. https://aws.amazon.com/ec2/.
[2]
Coates, A., Ng, A., Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (2011), Proceedings of Machine Learning Research (PMLR), 215--223.
[3]
Dean, J., Barroso, L.A. The tail at scale. Commun. ACM 56, 2 (2013), 74--80.
[4]
Dean, J., Ghemawat, S. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[5]
Dinan, J., Olivier, S., Sabin, G., Prins, J., Sadayappan, P., Tseng, C.-W. Dynamic load balancing of unbalanced computations using message passing. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IEEE, NY, 2007, 1--8.
[6]
Dutta, S., Cadambe, V., Grover, P. "Short-dot" computing large linear transforms distributedly using coded short dot products. In Proceedings of the 30th International Conference on Neural Information Processing Systems (2016), Neural Information Processing Systems Foundation, Inc. (NIPS), 2100--2108.
[7]
P. S. Foundation. Multiprocessing. 2008. https://docs.python.org/3/library/multiprocessing.html.
[8]
Gardner, K., Harchol-Balter, M., Scheller-Wolf, A., Van Houdt, B. A better model for job redundancy: Decoupling server slowdown and job size. IEEE/ACM Trans. Network 25, 6 (2017), 3353--3367.
[9]
Joshi, G. Synergy via redundancy: Boosting service capacity with adaptive replication. ACM SIGMETRICS Perform. Eval. Rev 45, 3 (2018), 21--28.
[10]
Joshi, G., Liu, Y., Soljanin, E. On the delay-storage trade-off in content download from coded distributed storage systems. IEEE J. Sel. Areas Commun 32, 5 (2014), 989--997.
[11]
Joshi, G., Soljanin, E., Wornell, G. Efficient redundancy techniques for latency reduction in cloud systems. ACM Trans. Model. Perform. Eval. Comput. Syst 2, 2 (2017), 1--30.
[12]
Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., Ramchandran, K. Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theory 64, 3 (2017), 1514--1529.
[13]
Li, S., Maddah-Ali, M.A., Avestimehr, A.S. A unified coding framework for distributed computing with straggling servers. In Proceedings of the IEEE Global Communications Conference Workshops, IEEE, NY, 2016, 1--6.
[14]
Luby, M. LT codes. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, IEEE, NY, 2002, 271--271.
[15]
Severinson, A., i Amat, A.G., Rosnes, E. Block-diagonal and LT codes for distributed computing with straggling servers. IEEE Trans. Commun 67, 3 (2018), 1739--1753.
[16]
Shokrollahi, A., Luby, M., et al. Raptor codes. Found. Trends® Commun. Inf. Theory 6, 3--4 (2011), 213--322.
[17]
Sun, Y., Zheng, Z., Koksal, C. E., Kim, K., Shroff, N. B. Provably delay efficient data retrieving in storage clouds. In Proceedings of the IEEE Conference on Computer Communications, IEEE, NY, 2015.
[18]
Wang, D., Joshi, G., Wornell, G.W. Efficient straggler replication in large-scale parallel computing. ACM Trans. Model. Perform. Eval. Comput. Syst 4, 2 (2019), 7:1--7:23.
[19]
Wang, S., Liu, J., Shroff, N. Coded sparse matrix multiplication. In Proceedings of the International Conference on Machine Learning (2018), Proceedings of Machine Learning Research (PMLR), 5152--5160.
[20]
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I. Spark: cluster computing with working sets. HotCloud 10, 10 (2010), 95.

Cited By

View all
  • (2024)Fault-Tolerant Parallel Integer MultiplicationProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659961(207-218)Online publication date: 17-Jun-2024
  • (2024)Relay Selection and Load Allocation for LT Coded Distributed Computing in Two- Hop Heterogeneous Computation Network2024 IEEE Wireless Communications and Networking Conference (WCNC)10.1109/WCNC57260.2024.10571074(1-6)Online publication date: 21-Apr-2024
  • (2024)Wireless Distributed Matrix-Vector Multiplication Using Over-the-Air Computation and Analog CodingIEEE Transactions on Wireless Communications10.1109/TWC.2024.336654723:8_Part_2(9826-9838)Online publication date: 1-Aug-2024
  • Show More Cited By

Index Terms

  1. Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Communications of the ACM
      Communications of the ACM  Volume 65, Issue 5
      May 2022
      108 pages
      ISSN:0001-0782
      EISSN:1557-7317
      DOI:10.1145/3533590
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 April 2022
      Published in CACM Volume 65, Issue 5

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)438
      • Downloads (Last 6 weeks)82
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Fault-Tolerant Parallel Integer MultiplicationProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659961(207-218)Online publication date: 17-Jun-2024
      • (2024)Relay Selection and Load Allocation for LT Coded Distributed Computing in Two- Hop Heterogeneous Computation Network2024 IEEE Wireless Communications and Networking Conference (WCNC)10.1109/WCNC57260.2024.10571074(1-6)Online publication date: 21-Apr-2024
      • (2024)Wireless Distributed Matrix-Vector Multiplication Using Over-the-Air Computation and Analog CodingIEEE Transactions on Wireless Communications10.1109/TWC.2024.336654723:8_Part_2(9826-9838)Online publication date: 1-Aug-2024
      • (2024)Sparsity and Privacy in Secret Sharing: A Fundamental Trade-OffIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.339425619(5136-5150)Online publication date: 2024
      • (2024)Coded Reactive Stragglers Mitigation in Distributed Computing SystemsIEEE Transactions on Communications10.1109/TCOMM.2023.334777272:8(4527-4537)Online publication date: Aug-2024
      • (2024)Wireless Coded Computation With Error DetectionIEEE Transactions on Communications10.1109/TCOMM.2023.333481072:3(1273-1289)Online publication date: Mar-2024
      • (2023)Network Coding Approaches for Distributed Computation over Lossy Wireless NetworksEntropy10.3390/e2503042825:3(428)Online publication date: 27-Feb-2023
      • (2023)Transition Waste Optimization for Coded Elastic ComputingIEEE Transactions on Information Theory10.1109/TIT.2023.324786069:7(4442-4465)Online publication date: 1-Jul-2023
      • (2023)Deadline-Aware Coded Computation Across Homogeneous WorkersIEEE Signal Processing Letters10.1109/LSP.2023.329953130(982-986)Online publication date: 2023
      • (2022)Adaptive Private Distributed Matrix MultiplicationIEEE Transactions on Information Theory10.1109/TIT.2022.314319968:4(2653-2673)Online publication date: Apr-2022
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Digital Edition

      View this article in digital edition.

      Digital Edition

      Magazine Site

      View this article on the magazine site (external)

      Magazine Site

      Login options

      Full Access

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media