Abstract
We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions—introduced by Zolotarev (ZOLO) in 1877—this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3× speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.
- TOP500 Supercomputing Sites. 2017. Retrieved from http://www.top500.org/.Google Scholar
- Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exper. 23, 2 (2011), 187--198. https://hal.inria.fr/inria-00550877 Google ScholarDigital Library
- L. Suzan Blackford, J. Choi, Andy Cleary, Eduardo F. D’Azevedo, James W. Demmel, Inderjit S. Dhillon, Jack J. Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clint Whaley. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia.Google Scholar
- Ralph Byers and Hongguo Xu. 2008. A new scaling for Newton’s iteration for the polar decomposition and its backward stability. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 822--843. Google ScholarDigital Library
- Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Key, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60. Google ScholarDigital Library
- Walter Gander. 1985. On Halley’s iteration method. Amer. Math. Monthly 92, 2 (1985), 131--134.Google ScholarCross Ref
- Walter Gander. 1990. Algorithms for the polar decomposition. SIAM J. Sci. Comput. 11, 6 (1990), 1102--1115. Google ScholarDigital Library
- Gene H. Golub and C. Reinsch. 1970. Singular value decomposition and least squares solutions. Numerische Mathematik 14 (1970), 403--420. Google ScholarDigital Library
- Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD. Google ScholarDigital Library
- Nicholas J. Higham. 2008. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
- Nicholas J. Higham and Pythagoras Papadimitriou. 1994. A parallel algorithm for computing the polar decomposition. Parallel Comput. 20, 8 (Aug. 1994), 1161--1173. Google ScholarDigital Library
- C. S. Kenney and A. J. Laub. 1992. On scaling Newton’s method for polar decomposition and the matrix sign function. SIAM J. Matr. Anal. Appl. 13 (1992), 688--706. Cited in a personal communication by Alan Laub. Google ScholarDigital Library
- Andrzej Kielbasinski and Krystyna Zietak. 2003. Numerical behaviour of Higham’s scaled method for polar decomposition. Numer. Algor. 32, 2--4 (2003), 105--140.Google ScholarCross Ref
- B. Laszkiewicz and K. Zietak. 2006. Approximation of matrices and a family of gander methods for polar decomposition. BIT Numer. Math. 46, 2 (2006), 345--366.Google ScholarCross Ref
- Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley’s iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl. 31, 5 (2010), 2700--2720. Google ScholarDigital Library
- Yuji Nakatsukasa and Roland W. Freund. 2016. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev’s functions. SIAM Rev. 58, 3 (2016), 461--493.Google ScholarDigital Library
- Yuji Nakatsukasa and Nicholas J. Higham. 2012. Backward stability of iterations for computing the polar decomposition. SIAM J. Matr. Anal. Appl. 33, 2 (2012), 460--479.Google ScholarCross Ref
- Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, 3 (2013), A1325--A1349.Google ScholarCross Ref
- Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw 39, 2 (2013), 13. Google ScholarDigital Library
- Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw 39, 2 (2013), 13. Google ScholarDigital Library
- Dalal Sukkari, Hatem Ltaief, Aniello Esposito, and David Keyes. 2017. A QDWH-based SVD software framework on distributed-memory manycore systems. Submitted to ACM Trans. Math. Softw. (under revision). KAUST technical report available at http://hdl.handle.net/10754/626212. Google ScholarDigital Library
- Dalal Sukkari, Hatem Ltaief, Mathieu Faverge, and David E. Keyes. 2018. Asynchronous task-based polar decomposition on single node manycore architectures. IEEE Trans. Parallel Distrib. Syst 29, 2 (2018), 312--323.Google ScholarCross Ref
- Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. A high-performance QDWH-SVD solver using hardware accelerators. ACM Trans. Math. Softw. 43, 1 (2016), 6:1--6:25. Google ScholarDigital Library
- Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. High-performance polar decomposition on distributed memory systems. In Proceedings of the 22nd International Conference on Parallel and Distributed Computing (EuroPar’16) (Lecture Notes in Computer Science), Pierre-François Dutot and Denis Trystram (Eds.), Vol. 9833. Springer, 605--616.Google Scholar
- Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. Retrieved from http://www.siam.org/books/OT50/Index.htm.Google Scholar
- D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericás. 2017. Trends in data locality abstractions for HPC systems. IEEE Trans. Parallel Distrib. Syst. 28, 10 (Oct. 2017), 3007--3020.Google ScholarDigital Library
- E. I. Zolotarev. 1877. Application of elliptic functions to questions of functions deviating least and most from zero. Zap. Imp. Akad. Nauk. St. Petersburg 30, 5 (1877). (Reprinted in his Collected Works, Vol. II, Izdat. Akad. Nauk SSSR, Moscow, 1932, pp. 1--59. In Russian).Google Scholar
Index Terms
- Massively Parallel Polar Decomposition on Distributed-memory Systems
Recommendations
Fast Polar Decomposition of an Arbitrary Matrix
The polar decomposition of an $m \times n$ matrix A of full rank, where $m \geqq n$, can be computed using a quadratically convergent algorithm of Higham [SIAM J. Sci. Statist. Comput., 7(1986), pp. 1160–1174]. The algorithm is based on a Newton ...
Parallel $${\mathcal {H}}$$H-matrix arithmetic on distributed-memory systems
In the last decade, the hierarchical matrix technique was introduced to deal with dense matrices in an efficient way. It provides a data-sparse format and allows an approximate matrix algebra of nearly optimal complexity. This paper is concerned with ...
Algorithms for the Polar Decomposition
For the polar decomposition of a square nonsingular matrix, Higham [SIAM J. Sci. Statist. Comput., 7 (1986), pp. 1160–1174] has given a reliable quadratically convergent algorithm that is based on Newton's iteration. Motivated by Halley's iteration, the ...
Comments