skip to main content
research-article

Massively Parallel Polar Decomposition on Distributed-memory Systems

Authors Info & Claims
Published:07 June 2019Publication History
Skip Abstract Section

Abstract

We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions—introduced by Zolotarev (ZOLO) in 1877—this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3× speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.

References

  1. TOP500 Supercomputing Sites. 2017. Retrieved from http://www.top500.org/.Google ScholarGoogle Scholar
  2. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exper. 23, 2 (2011), 187--198. https://hal.inria.fr/inria-00550877 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Suzan Blackford, J. Choi, Andy Cleary, Eduardo F. D’Azevedo, James W. Demmel, Inderjit S. Dhillon, Jack J. Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clint Whaley. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia.Google ScholarGoogle Scholar
  4. Ralph Byers and Hongguo Xu. 2008. A new scaling for Newton’s iteration for the polar decomposition and its backward stability. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 822--843. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Key, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Walter Gander. 1985. On Halley’s iteration method. Amer. Math. Monthly 92, 2 (1985), 131--134.Google ScholarGoogle ScholarCross RefCross Ref
  7. Walter Gander. 1990. Algorithms for the polar decomposition. SIAM J. Sci. Comput. 11, 6 (1990), 1102--1115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gene H. Golub and C. Reinsch. 1970. Singular value decomposition and least squares solutions. Numerische Mathematik 14 (1970), 403--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Nicholas J. Higham. 2008. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nicholas J. Higham and Pythagoras Papadimitriou. 1994. A parallel algorithm for computing the polar decomposition. Parallel Comput. 20, 8 (Aug. 1994), 1161--1173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. S. Kenney and A. J. Laub. 1992. On scaling Newton’s method for polar decomposition and the matrix sign function. SIAM J. Matr. Anal. Appl. 13 (1992), 688--706. Cited in a personal communication by Alan Laub. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Andrzej Kielbasinski and Krystyna Zietak. 2003. Numerical behaviour of Higham’s scaled method for polar decomposition. Numer. Algor. 32, 2--4 (2003), 105--140.Google ScholarGoogle ScholarCross RefCross Ref
  14. B. Laszkiewicz and K. Zietak. 2006. Approximation of matrices and a family of gander methods for polar decomposition. BIT Numer. Math. 46, 2 (2006), 345--366.Google ScholarGoogle ScholarCross RefCross Ref
  15. Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley’s iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl. 31, 5 (2010), 2700--2720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yuji Nakatsukasa and Roland W. Freund. 2016. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev’s functions. SIAM Rev. 58, 3 (2016), 461--493.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yuji Nakatsukasa and Nicholas J. Higham. 2012. Backward stability of iterations for computing the polar decomposition. SIAM J. Matr. Anal. Appl. 33, 2 (2012), 460--479.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, 3 (2013), A1325--A1349.Google ScholarGoogle ScholarCross RefCross Ref
  19. Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw 39, 2 (2013), 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw 39, 2 (2013), 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Dalal Sukkari, Hatem Ltaief, Aniello Esposito, and David Keyes. 2017. A QDWH-based SVD software framework on distributed-memory manycore systems. Submitted to ACM Trans. Math. Softw. (under revision). KAUST technical report available at http://hdl.handle.net/10754/626212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dalal Sukkari, Hatem Ltaief, Mathieu Faverge, and David E. Keyes. 2018. Asynchronous task-based polar decomposition on single node manycore architectures. IEEE Trans. Parallel Distrib. Syst 29, 2 (2018), 312--323.Google ScholarGoogle ScholarCross RefCross Ref
  23. Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. A high-performance QDWH-SVD solver using hardware accelerators. ACM Trans. Math. Softw. 43, 1 (2016), 6:1--6:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. High-performance polar decomposition on distributed memory systems. In Proceedings of the 22nd International Conference on Parallel and Distributed Computing (EuroPar’16) (Lecture Notes in Computer Science), Pierre-François Dutot and Denis Trystram (Eds.), Vol. 9833. Springer, 605--616.Google ScholarGoogle Scholar
  25. Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. Retrieved from http://www.siam.org/books/OT50/Index.htm.Google ScholarGoogle Scholar
  26. D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericás. 2017. Trends in data locality abstractions for HPC systems. IEEE Trans. Parallel Distrib. Syst. 28, 10 (Oct. 2017), 3007--3020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. I. Zolotarev. 1877. Application of elliptic functions to questions of functions deviating least and most from zero. Zap. Imp. Akad. Nauk. St. Petersburg 30, 5 (1877). (Reprinted in his Collected Works, Vol. II, Izdat. Akad. Nauk SSSR, Moscow, 1932, pp. 1--59. In Russian).Google ScholarGoogle Scholar

Index Terms

  1. Massively Parallel Polar Decomposition on Distributed-memory Systems

      Recommendations

      Reviews

      Wolfgang Schreiner

      For solving problems with high computational demands, it is nowadays essential to apply parallel algorithms that scale effectively to a large number of computational cores: microprocessors may integrate dozens of cores, and compute clusters may combine thousands of processors. One such problem is the polar decomposition (PD) of a matrix, a linear algebra operation "required for a broad class of scientific applications." So far, this problem was most effectively solved by the QR-factorization-based dynamically weighted Halley (QDWH) algorithm that converges for double precision matrices in at most six iterations. This paper improves QDWH by utilizing special properties of the Zolotarev (ZOLO) functions underlying the iteration. The resulting ZOLO-PD algorithm derives the (2 r +1)-th approximation of the solution from r independent QR factorizations; then, for r =8, convergence is achieved with only two iterations. Thus, ZOLO-PD exposes a much higher degree of parallelism than QDWH, but at the price of a higher total arithmetic complexity and a larger memory footprint. After a thorough description of the message passing interface (MPI) implementation of the algorithm, the paper then evaluates its actual performance: indeed, ZOLO-PD outperforms QDWH by a factor of two, but only for a sufficiently large amount of resources (800 nodes, 32 cores). All in all, this represents a very informative case study on massively parallel algorithm design and the tradeoff between algorithmic complexity and inherent parallelism. The thorough evaluation of the algorithm is a model for other researchers. Finally, the paper opens up new research opportunities with respect to data movement optimization and dynamic scheduling to improve load balancing.

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Parallel Computing
        ACM Transactions on Parallel Computing  Volume 6, Issue 1
        March 2019
        118 pages
        ISSN:2329-4949
        EISSN:2329-4957
        DOI:10.1145/3331062
        • Editor:
        • David Bader
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 June 2019
        • Accepted: 1 March 2019
        • Revised: 1 November 2018
        • Received: 1 January 2018
        Published in topc Volume 6, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format