skip to main content
research-article
Public Access

Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations

Published:17 January 2017Publication History
Skip Abstract Section

Abstract

This article derives trade-offs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. These trade-offs are lower bounds on the execution time of the algorithm that are independent of the number of processors but dependent on the problem size. Therefore, they provide lower bounds on the execution time of any parallel schedule of an algorithm computed by a system composed of any number of homogeneous processors, each with associated computational, communication, and synchronization costs. We employ a theoretical model that measures the amount of work and data movement as a maximum over that incurred along any execution path during the parallel computation. By considering this metric rather than the total communication volume over the whole machine, we obtain new insights into the characteristics of parallel schedules for algorithms with nontrivial dependency structures. We also present reductions from BSP and LogGP algorithms to our execution model, extending our lower bounds to these two models of parallel computation. We first develop our results for general dependency graphs and hypergraphs based on their expansion properties, and then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Cholesky factorization, and stencil computations. We represent some of these algorithms as families of dependency graphs. We derive their communication lower bounds by studying the communication requirements of the hypergraph structures shared by these dependency graphs. In addition to these lower bounds, we introduce a new communication-efficient parallelization for stencil computation algorithms, which is motivated by results of our lower bound analysis and the properties of previously existing parallelizations of the algorithms.

References

  1. Alok Aggarwal, Ashok K. Chandra, and Marc Snir. 1990. Communication complexity of PRAMs. Theoretical Computer Science 71, 1, 3--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. 1995. LogGP: Incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’95). ACM, New York, NY, 95--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’12). ACM, New York, NY, 77--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing communication in linear algebra. SIAM Journal on Matrix Analysis and Applications 32, 3, 866--901.Google ScholarGoogle ScholarCross RefCross Ref
  5. E. Bampis, C. Delorme, and J.-C. König. 1996. Optimal schedules for d-D grid graphs with communication delays. In STACS 96. Lecture Notes in Computer Science, Vol. 1046. Springer, 655--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Richard Bellman. 1956. On a Routing Problem. Technical Report. DTIC Document.Google ScholarGoogle Scholar
  7. Michael A. Bender, Gerth Stølting Brodal, Rolf Fagerberg, Riko Jacob, and Elias Vicari. 2010. Optimal sparse matrix dense vector multiplication in the I/O-model. Theory of Computing Systems 47, 4, 934--962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Bilardi and F. P. Preparata. 1999. Processor--time tradeoffs under bounded-speed message propagation: Part II, lower bounds. Theory of Computing Systems 32, 5, 531--559.Google ScholarGoogle ScholarCross RefCross Ref
  9. Gianfranco Bilardi, Michele Scquizzato, and Francesco Silvestri. 2012. A lower bound technique for communication on BSP with application to the FFT. In Euro-Par 2012 Parallel Processing. Lecture Notes in Computer Science, Vol. 7484. Springer, 676--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, et al. 1997. ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.Google ScholarGoogle Scholar
  11. Erin Carson, Nick Knight, and James Demmel. 2013. Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods. SIAM Journal on Scientific Computing 35, 5, S42--S61.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. 2013. Communication lower bounds and optimal algorithms for programs that reference arrays--Part 1. arXiv:1308.0068.Google ScholarGoogle Scholar
  13. David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’93). ACM, New York, NY, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. 2008. Avoiding communication in sparse matrix computations. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). IEEE, Los Alamitos, CA, 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  15. Robert W. Floyd. 1962. Algorithm 97: Shortest path. Communications of the ACM 5, 6, 345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lester Randolph Ford. 1956. Network Flow Theory. RAND Corporation, Santa Monica, CA.Google ScholarGoogle Scholar
  17. Laura Grigori, James W. Demmel, and Hua Xiang. 2011. CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications 32, 4, 1317--1350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael T. Heath and Charles H. Romine. 1988. Parallel solution of triangular systems on distributed-memory multiprocessors. SIAM Journal on Scientific and Statistical Computing 9, 3, 558--588.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J.-W. Hong and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC’81). ACM, New York, NY, 326--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dror Irony and Sivan Toledo. 2002. Trading replication for communication in parallel distributed-memory dense solvers. Parallel Processing Letters 71, 3--28.Google ScholarGoogle Scholar
  21. Dror Irony, Sivan Toledo, and Alexander Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing 64, 9, 1017--1026. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Charles E. Leiserson, Satish Rao, and Sivan Toledo. 1997. Efficient out-of-core algorithms for linear relaxation using blocking covers. Journal of Computer and System Sciences 54, 2, 332--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lynn H. Loomis and Hassler Whitney. 1949. An inequality related to the isoperimetric inequality. Bulletin of the American Mathematical Society 55, 10, 961--962.Google ScholarGoogle ScholarCross RefCross Ref
  24. William F. McColl and Alexander Tiskin. 1999. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 3, 287--297.Google ScholarGoogle ScholarCross RefCross Ref
  25. Marghoob Mohiyuddin, Mark Hoemmen, James Demmel, and Katherine Yelick. 2009. Minimizing communication in sparse matrix solvers. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis. ACM, New York, NY, 36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Christos H. Papadimitriou and Jeffrey D. Ullman. 1987. A communication-time tradeoff. SIAM Journal on Computing 16, 4, 639--646. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Transactions on Mathematical Software 39, 2, 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Michele Scquizzato and Francesco Silvestri. 2014. Communication lower bounds for distributed-memory computations. In Proceedings of the 31st International Symposium on Theoretical Aspects of Computer Science (STACS’14). 627--638.Google ScholarGoogle Scholar
  29. Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel. 2014. Tradeoffs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations. ACM, New York, NY, 307--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Edgar Solomonik and James Demmel. 2011. Communication-optimal 2.5D matrix multiplication and LU factorization algorithms. In Proceedings of the 17th International Conference on Parallel Processing (Euro-Par’11)—Volume Part II. 90--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Danny C. Sorensen. 1985. Analysis of pairwise pivoting in Gaussian elimination. IEEE Transactions on Computers C-34, 3, 274--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Volker Strassen. 1969. Gaussian elimination is not optimal. Numerische Mathematik 13, 4, 354--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, NY, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Alexander Tiskin. 1998. The Design and Analysis of Bulk-Synchronous Parallel Algorithms. Ph.D. Dissertation. University of Oxford.Google ScholarGoogle Scholar
  35. Alexander Tiskin. 2001. All-pairs shortest paths computation in the BSP model. In Automata, Languages and Programming. Lecture Notes in Computer Science, Vol. 2076. Springer, 178--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Alexander Tiskin. 2002. Bulk-synchronous parallel Gaussian elimination. Journal of Mathematical Sciences 108, 6, 977--991.Google ScholarGoogle ScholarCross RefCross Ref
  37. Alexander Tiskin. 2007. Communication-efficient parallel generic pairwise elimination. Future Generation Computer Systems 23, 2, 179--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8, 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Stephen Warshall. 1962. A theorem on boolean matrices. Journal of the ACM 9, 1, 11--12. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Parallel Computing
          ACM Transactions on Parallel Computing  Volume 3, Issue 1
          Special Issue for SPAA 2014
          June 2016
          192 pages
          ISSN:2329-4949
          EISSN:2329-4957
          DOI:10.1145/2965648
          Issue’s Table of Contents

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 January 2017
          • Accepted: 1 January 2016
          • Revised: 1 December 2014
          • Received: 1 August 2014
          Published in topc Volume 3, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader