Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations

Authors:
Edgar Solomonik

University of California, Berkeley

University of California, Berkeley
View Profile

,
Erin Carson

University of California, Berkeley, Berkeley CA

University of California, Berkeley, Berkeley CA
View Profile

,
Nicholas Knight

University of California, Berkeley, Berkeley CA

University of California, Berkeley, Berkeley CA
View Profile

,
James Demmel

University of California, Berkeley, Berkeley CA

University of California, Berkeley, Berkeley CA
View Profile

Authors Info & Claims

ACM Transactions on Parallel Computing Volume 3 Issue 1Article No.: 3pp 1–47https://doi.org/10.1145/2897188

Published:17 January 2017Publication History

ACM Transactions on Parallel Computing

Abstract

This article derives trade-offs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. These trade-offs are lower bounds on the execution time of the algorithm that are independent of the number of processors but dependent on the problem size. Therefore, they provide lower bounds on the execution time of any parallel schedule of an algorithm computed by a system composed of any number of homogeneous processors, each with associated computational, communication, and synchronization costs. We employ a theoretical model that measures the amount of work and data movement as a maximum over that incurred along any execution path during the parallel computation. By considering this metric rather than the total communication volume over the whole machine, we obtain new insights into the characteristics of parallel schedules for algorithms with nontrivial dependency structures. We also present reductions from BSP and LogGP algorithms to our execution model, extending our lower bounds to these two models of parallel computation. We first develop our results for general dependency graphs and hypergraphs based on their expansion properties, and then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Cholesky factorization, and stencil computations. We represent some of these algorithms as families of dependency graphs. We derive their communication lower bounds by studying the communication requirements of the hypergraph structures shared by these dependency graphs. In addition to these lower bounds, we introduce a new communication-efficient parallelization for stencil computation algorithms, which is motivated by results of our lower bound analysis and the properties of previously existing parallelizations of the algorithms.

References

Alok Aggarwal, Ashok K. Chandra, and Marc Snir. 1990. Communication complexity of PRAMs. Theoretical Computer Science 71, 1, 3--28. Google ScholarDigital Library
Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. 1995. LogGP: Incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’95). ACM, New York, NY, 95--105. Google ScholarDigital Library
Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’12). ACM, New York, NY, 77--79. Google ScholarDigital Library
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing communication in linear algebra. SIAM Journal on Matrix Analysis and Applications 32, 3, 866--901.Google ScholarCross Ref
E. Bampis, C. Delorme, and J.-C. König. 1996. Optimal schedules for d-D grid graphs with communication delays. In STACS 96. Lecture Notes in Computer Science, Vol. 1046. Springer, 655--666. Google ScholarDigital Library
Richard Bellman. 1956. On a Routing Problem. Technical Report. DTIC Document.Google Scholar
Michael A. Bender, Gerth Stølting Brodal, Rolf Fagerberg, Riko Jacob, and Elias Vicari. 2010. Optimal sparse matrix dense vector multiplication in the I/O-model. Theory of Computing Systems 47, 4, 934--962. Google ScholarDigital Library
G. Bilardi and F. P. Preparata. 1999. Processor--time tradeoffs under bounded-speed message propagation: Part II, lower bounds. Theory of Computing Systems 32, 5, 531--559.Google ScholarCross Ref
Gianfranco Bilardi, Michele Scquizzato, and Francesco Silvestri. 2012. A lower bound technique for communication on BSP with application to the FFT. In Euro-Par 2012 Parallel Processing. Lecture Notes in Computer Science, Vol. 7484. Springer, 676--687. Google ScholarDigital Library
L. S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, et al. 1997. ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.Google Scholar
Erin Carson, Nick Knight, and James Demmel. 2013. Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods. SIAM Journal on Scientific Computing 35, 5, S42--S61.Google ScholarDigital Library
Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. 2013. Communication lower bounds and optimal algorithms for programs that reference arrays--Part 1. arXiv:1308.0068.Google Scholar
David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’93). ACM, New York, NY, 1--12. Google ScholarDigital Library
James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. 2008. Avoiding communication in sparse matrix computations. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). IEEE, Los Alamitos, CA, 1--12.Google ScholarCross Ref
Robert W. Floyd. 1962. Algorithm 97: Shortest path. Communications of the ACM 5, 6, 345. Google ScholarDigital Library
Lester Randolph Ford. 1956. Network Flow Theory. RAND Corporation, Santa Monica, CA.Google Scholar
Laura Grigori, James W. Demmel, and Hua Xiang. 2011. CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications 32, 4, 1317--1350. Google ScholarDigital Library
Michael T. Heath and Charles H. Romine. 1988. Parallel solution of triangular systems on distributed-memory multiprocessors. SIAM Journal on Scientific and Statistical Computing 9, 3, 558--588.Google ScholarDigital Library
J.-W. Hong and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC’81). ACM, New York, NY, 326--333. Google ScholarDigital Library
Dror Irony and Sivan Toledo. 2002. Trading replication for communication in parallel distributed-memory dense solvers. Parallel Processing Letters 71, 3--28.Google Scholar
Dror Irony, Sivan Toledo, and Alexander Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing 64, 9, 1017--1026. Google ScholarDigital Library
Charles E. Leiserson, Satish Rao, and Sivan Toledo. 1997. Efficient out-of-core algorithms for linear relaxation using blocking covers. Journal of Computer and System Sciences 54, 2, 332--344. Google ScholarDigital Library
Lynn H. Loomis and Hassler Whitney. 1949. An inequality related to the isoperimetric inequality. Bulletin of the American Mathematical Society 55, 10, 961--962.Google ScholarCross Ref
William F. McColl and Alexander Tiskin. 1999. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 3, 287--297.Google ScholarCross Ref
Marghoob Mohiyuddin, Mark Hoemmen, James Demmel, and Katherine Yelick. 2009. Minimizing communication in sparse matrix solvers. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis. ACM, New York, NY, 36. Google ScholarDigital Library
Christos H. Papadimitriou and Jeffrey D. Ullman. 1987. A communication-time tradeoff. SIAM Journal on Computing 16, 4, 639--646. Google ScholarDigital Library
Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Transactions on Mathematical Software 39, 2, 13. Google ScholarDigital Library
Michele Scquizzato and Francesco Silvestri. 2014. Communication lower bounds for distributed-memory computations. In Proceedings of the 31st International Symposium on Theoretical Aspects of Computer Science (STACS’14). 627--638.Google Scholar
Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel. 2014. Tradeoffs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations. ACM, New York, NY, 307--318. Google ScholarDigital Library
Edgar Solomonik and James Demmel. 2011. Communication-optimal 2.5D matrix multiplication and LU factorization algorithms. In Proceedings of the 17th International Conference on Parallel Processing (Euro-Par’11)—Volume Part II. 90--109. Google ScholarDigital Library
Danny C. Sorensen. 1985. Analysis of pairwise pivoting in Gaussian elimination. IEEE Transactions on Computers C-34, 3, 274--278. Google ScholarDigital Library
Volker Strassen. 1969. Gaussian elimination is not optimal. Numerische Mathematik 13, 4, 354--356. Google ScholarDigital Library
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, NY, 117--128. Google ScholarDigital Library
Alexander Tiskin. 1998. The Design and Analysis of Bulk-Synchronous Parallel Algorithms. Ph.D. Dissertation. University of Oxford.Google Scholar
Alexander Tiskin. 2001. All-pairs shortest paths computation in the BSP model. In Automata, Languages and Programming. Lecture Notes in Computer Science, Vol. 2076. Springer, 178--189. Google ScholarDigital Library
Alexander Tiskin. 2002. Bulk-synchronous parallel Gaussian elimination. Journal of Mathematical Sciences 108, 6, 977--991.Google ScholarCross Ref
Alexander Tiskin. 2007. Communication-efficient parallel generic pairwise elimination. Future Generation Computer Systems 23, 2, 179--188. Google ScholarDigital Library
Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8, 103--111. Google ScholarDigital Library
Stephen Warshall. 1962. A theorem on boolean matrices. Journal of the ACM 9, 1, 11--12. Google ScholarDigital Library

Index Terms

Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
2. Theory of computation
  1. Computational complexity and cryptography
    1. Communication complexity
  2. Design and analysis of algorithms
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations
SPAA '14: Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures

This paper derives tradeoffs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. These tradeoffs are lower bounds on the execution time of the algorithm which are independent of the number of ...
Read More
Trade-Offs between Depth and Width in Parallel Computation

A new technique for proving lower bounds for parallel computation is introduced. This technique enables us to obtain, for the first time, nontrivial tight lower bounds for shared-memory models of parallel computation that allow several processors to have ...
Read More
Parallel implementation of linear algebra problems on Dawning-1000
Abstract
In this paper, some parallel algorithms are described for solving numerical linear algebra problems on Dwning-1000. They include matrix multiplication,LU factorization of a dense matrix, Cholesky factorization of a symmetric matrix, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Parallel Computing Volume 3, Issue 1
Special Issue for SPAA 2014
June 2016
192 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2965648
Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 January 2017
- Accepted: 1 January 2016
- Revised: 1 December 2014
- Received: 1 August 2014
Published in topc Volume 3, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Communication lower bounds
graph expansion
numerical linear algebra
stencil computations
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 715
  Total Downloads
- Downloads (Last 12 months)85
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations

ACM Transactions on Parallel Computing

Abstract

References

Cited By

Index Terms

Recommendations

Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations

Trade-Offs between Depth and Width in Parallel Computation

Parallel implementation of linear algebra problems on Dawning-1000