research-article

Massively Parallel Polar Decomposition on Distributed-memory Systems

Authors:
Hatem Ltaief

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, KSA

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, KSA

0000-0002-6897-1095
View Profile

,
Dalal Sukkari

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, KSA

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, KSA
View Profile

,
Aniello Esposito

Cray EMEA Research Lab, Bristol, UK

Cray EMEA Research Lab, Bristol, UK
View Profile

,
Yuji Nakatsukasa

Mathematical Institute, University of Oxford, Oxford, UK

Mathematical Institute, University of Oxford, Oxford, UK
View Profile

,
David Keyes

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, KSA

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, KSA
View Profile

Authors Info & Claims

ACM Transactions on Parallel Computing Volume 6 Issue 1Article No.: 4pp 1–15https://doi.org/10.1145/3328723

Published:07 June 2019Publication History

ACM Transactions on Parallel Computing

Abstract

We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions—introduced by Zolotarev (ZOLO) in 1877—this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3× speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.

References

TOP500 Supercomputing Sites. 2017. Retrieved from http://www.top500.org/.Google Scholar
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exper. 23, 2 (2011), 187--198. https://hal.inria.fr/inria-00550877 Google ScholarDigital Library
L. Suzan Blackford, J. Choi, Andy Cleary, Eduardo F. D’Azevedo, James W. Demmel, Inderjit S. Dhillon, Jack J. Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clint Whaley. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia.Google Scholar
Ralph Byers and Hongguo Xu. 2008. A new scaling for Newton’s iteration for the polar decomposition and its backward stability. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 822--843. Google ScholarDigital Library
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Key, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60. Google ScholarDigital Library
Walter Gander. 1985. On Halley’s iteration method. Amer. Math. Monthly 92, 2 (1985), 131--134.Google ScholarCross Ref
Walter Gander. 1990. Algorithms for the polar decomposition. SIAM J. Sci. Comput. 11, 6 (1990), 1102--1115. Google ScholarDigital Library
Gene H. Golub and C. Reinsch. 1970. Singular value decomposition and least squares solutions. Numerische Mathematik 14 (1970), 403--420. Google ScholarDigital Library
Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD. Google ScholarDigital Library
Nicholas J. Higham. 2008. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
Nicholas J. Higham and Pythagoras Papadimitriou. 1994. A parallel algorithm for computing the polar decomposition. Parallel Comput. 20, 8 (Aug. 1994), 1161--1173. Google ScholarDigital Library
C. S. Kenney and A. J. Laub. 1992. On scaling Newton’s method for polar decomposition and the matrix sign function. SIAM J. Matr. Anal. Appl. 13 (1992), 688--706. Cited in a personal communication by Alan Laub. Google ScholarDigital Library
Andrzej Kielbasinski and Krystyna Zietak. 2003. Numerical behaviour of Higham’s scaled method for polar decomposition. Numer. Algor. 32, 2--4 (2003), 105--140.Google ScholarCross Ref
B. Laszkiewicz and K. Zietak. 2006. Approximation of matrices and a family of gander methods for polar decomposition. BIT Numer. Math. 46, 2 (2006), 345--366.Google ScholarCross Ref
Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley’s iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl. 31, 5 (2010), 2700--2720. Google ScholarDigital Library
Yuji Nakatsukasa and Roland W. Freund. 2016. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev’s functions. SIAM Rev. 58, 3 (2016), 461--493.Google ScholarDigital Library
Yuji Nakatsukasa and Nicholas J. Higham. 2012. Backward stability of iterations for computing the polar decomposition. SIAM J. Matr. Anal. Appl. 33, 2 (2012), 460--479.Google ScholarCross Ref
Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, 3 (2013), A1325--A1349.Google ScholarCross Ref
Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw 39, 2 (2013), 13. Google ScholarDigital Library
Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw 39, 2 (2013), 13. Google ScholarDigital Library
Dalal Sukkari, Hatem Ltaief, Aniello Esposito, and David Keyes. 2017. A QDWH-based SVD software framework on distributed-memory manycore systems. Submitted to ACM Trans. Math. Softw. (under revision). KAUST technical report available at http://hdl.handle.net/10754/626212. Google ScholarDigital Library
Dalal Sukkari, Hatem Ltaief, Mathieu Faverge, and David E. Keyes. 2018. Asynchronous task-based polar decomposition on single node manycore architectures. IEEE Trans. Parallel Distrib. Syst 29, 2 (2018), 312--323.Google ScholarCross Ref
Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. A high-performance QDWH-SVD solver using hardware accelerators. ACM Trans. Math. Softw. 43, 1 (2016), 6:1--6:25. Google ScholarDigital Library
Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. High-performance polar decomposition on distributed memory systems. In Proceedings of the 22nd International Conference on Parallel and Distributed Computing (EuroPar’16) (Lecture Notes in Computer Science), Pierre-François Dutot and Denis Trystram (Eds.), Vol. 9833. Springer, 605--616.Google Scholar
Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. Retrieved from http://www.siam.org/books/OT50/Index.htm.Google Scholar
D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericás. 2017. Trends in data locality abstractions for HPC systems. IEEE Trans. Parallel Distrib. Syst. 28, 10 (Oct. 2017), 3007--3020.Google ScholarDigital Library
E. I. Zolotarev. 1877. Application of elliptic functions to questions of functions deviating least and most from zero. Zap. Imp. Akad. Nauk. St. Petersburg 30, 5 (1877). (Reprinted in his Collected Works, Vol. II, Izdat. Akad. Nauk SSSR, Moscow, 1932, pp. 1--59. In Russian).Google Scholar

Index Terms

Massively Parallel Polar Decomposition on Distributed-memory Systems
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

Fast Polar Decomposition of an Arbitrary Matrix

The polar decomposition of an $m \times n$ matrix A of full rank, where $m \geqq n$, can be computed using a quadratically convergent algorithm of Higham [SIAM J. Sci. Statist. Comput., 7(1986), pp. 1160–1174]. The algorithm is based on a Newton ...
Read More
Parallel $${\mathcal {H}}$$H-matrix arithmetic on distributed-memory systems

In the last decade, the hierarchical matrix technique was introduced to deal with dense matrices in an efficient way. It provides a data-sparse format and allows an approximate matrix algebra of nearly optimal complexity. This paper is concerned with ...
Read More
Algorithms for the Polar Decomposition

For the polar decomposition of a square nonsingular matrix, Higham [SIAM J. Sci. Statist. Comput., 7 (1986), pp. 1160–1174] has given a reliable quadratically convergent algorithm that is based on Newton's iteration. Motivated by Halley's iteration, the ...
Read More

Reviews

Reviewer: Wolfgang Schreiner

For solving problems with high computational demands, it is nowadays essential to apply parallel algorithms that scale effectively to a large number of computational cores: microprocessors may integrate dozens of cores, and compute clusters may combine thousands of processors. One such problem is the polar decomposition (PD) of a matrix, a linear algebra operation "required for a broad class of scientific applications." So far, this problem was most effectively solved by the QR-factorization-based dynamically weighted Halley (QDWH) algorithm that converges for double precision matrices in at most six iterations. This paper improves QDWH by utilizing special properties of the Zolotarev (ZOLO) functions underlying the iteration. The resulting ZOLO-PD algorithm derives the (2 r +1)-th approximation of the solution from r independent QR factorizations; then, for r =8, convergence is achieved with only two iterations. Thus, ZOLO-PD exposes a much higher degree of parallelism than QDWH, but at the price of a higher total arithmetic complexity and a larger memory footprint. After a thorough description of the message passing interface (MPI) implementation of the algorithm, the paper then evaluates its actual performance: indeed, ZOLO-PD outperforms QDWH by a factor of two, but only for a sufficiently large amount of resources (800 nodes, 32 cores). All in all, this represents a very informative case study on massively parallel algorithm design and the tradeoff between algorithmic complexity and inherent parallelism. The thorough evaluation of the algorithm is a model for other researchers. Finally, the paper opens up new research opportunities with respect to data movement optimization and dynamic scheduling to improve load balancing.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Parallel Computing Volume 6, Issue 1
March 2019
118 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3331062
Editor:
David Bader
Georgia Institute of Technology, USA
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 June 2019
- Accepted: 1 March 2019
- Revised: 1 November 2018
- Received: 1 January 2018
Published in topc Volume 6, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Polar decomposition
Zolotarev functions
distributed-memory systems
parallel algorithms
strong scaling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 193
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Massively Parallel Polar Decomposition on Distributed-memory Systems

ACM Transactions on Parallel Computing

Abstract

References

Cited By

Index Terms

Recommendations

Fast Polar Decomposition of an Arbitrary Matrix

Parallel $${\mathcal {H}}$$H-matrix arithmetic on distributed-memory systems

Algorithms for the Polar Decomposition

Reviews

Access critical reviews of Computing literature here