ABSTRACT
The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance.
In this article, we describe a successful methodology to address these challenges---starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver.
Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.
- E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser., 180(1), 2009. DOI: 10.1088/1742-6596/180/1/012037.Google Scholar
- E. Agullo, B. Hadri, H. Ltaief, and J. Dongarrra. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, PA, 1992. http://www.netlib.org/lapack/lug/. Google ScholarDigital Library
- H. C. Andrews and C. L. Patterson. Singular value decompositions and digital image processing. IEEE Trans Acoust., Speech, Signal Processing ASSP-24, 1, February 1976.Google Scholar
- R. H. Bartels, G. H. Golub, and M. Saunders. Numerical techniques in mathematical programming. In Nonhnear Programming, pages 123--176, New York, 1971. Academm Press.Google Scholar
- M. Bečka, G. Okša, and M. Vajteršic. Parallel Block-Jacobi SVD. In M. W. Berry, K. A. Gallivan, E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, and F. Saied, editors, Methods High-Performance Scientific Computing, pages 185--197. Springer, London Dordrecht Heidelberg New York, 2012. ISBN 978-1-4471-2436-8 e-ISBN 978-1-4471-2437-5 DOI 10.1007/978-1-4471-2437-5.Google Scholar
- C. Bischof, B. Lang, and X. Sun. Parallel tridiagonalization through two-step band reduction. In In Proceedings of the Scalable High-Performance Computing Conference, pages 23--27. IEEE Computer Society Press, 1994.Google ScholarCross Ref
- C. H. Bischof, B. Lang, and X. Sun. Algorithm 807: The SBR Toolbox---software for successive band reduction. ACM TOMS, 26(4):602--616, 2000. Google ScholarDigital Library
- C. H. Bischof and C. V. Loan. The WY representation for products of Householder matrices. SIAM J. Sci. Statist. Comput., 8:s2--s13, 1987. Google ScholarDigital Library
- N. Bregman, R. Bailey, and C. Chapman. Ghosts in tomography: the effects of poor angular coverage in 2-D seismic traveltime inversion. Can. J. Explor. Geophys, 25(1):7--27, 1989.Google Scholar
- A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. The impact of multicore on math software. In B. Kågström, E. Elmroth, J. Dongarra, and J. Wasniewski, editors, Applied Parallel Computing. State of the Art in Scientific Computing, 8th International Workshop, PARA, volume 4699 of Lecture Notes in Computer Science, pages 1--10. Springer, 2006. Google ScholarDigital Library
- A. Buttari, J. J. Dongarra, P. Husbands, J. Kurzak, and K. Yelick. Multithreading for synchronization tolerance in matrix factorization. In Scientific Discovery through Advanced Computing, SciDAC 2007, Boston, MA, June 24--28 2007. Journal of Physics: Conference Series 78:012028, IOP Publishing. DOI: 10.1088/1742-6596/78/1/012028.Google Scholar
- A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. Parallel tiled QR factorization for multicore architectures. Concurrency Computat.: Pract. Exper., 20(13):1573--1590, 2008. DOI: 10.1002/cpe.1301. Google ScholarCross Ref
- A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parellel Comput. Syst. Appl., 35:38--53, 2009. DOI: 10.1016/j.parco.2008.10.002. Google ScholarDigital Library
- E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 116--125, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- T. A. Davis and S. Rajamanickam. Algorithm 8xx: PIRO BAND, Pipelined Plane Rotations for Blocked Band Reduction. Submitted to ACM TOMS, Available at www.cise.ufl.edu/~srajaman/genband.pdf, 2010.Google Scholar
- P. Deift, J. W. Demmel, L.-C. Li, and C. Tomei. The bidiagonal singular value decomposition and Hamiltonian mechanics. SIAM J. Numer. Anal., 28(5):1463--1516, October 1991. (LAPACK Working Note #11). Google ScholarDigital Library
- J. W. Demmel and W. Kahan. Accurate singular values of bidiagonal matrices. SIAM J. Sci. Stat. Comput., 11(5):873--912, September 1990. (Also LAPACK LAWN #3). Google ScholarDigital Library
- J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. Exploiting fine-grain parallelism in recursive LU factorization. In ParCo 2011 -- International Conference on Parallel Computing, Ghent, Belgium, August 30-September 2 2011.Google Scholar
- J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. High performance matrix inversion based on LU factorization for multicore architectures. In Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers, MTAGS '11, pages 33--42, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. Exploiting fine-grain parallelism in recursive LU factorization. Advances in Parallel Computing, Special Issue, 22:429--436, 2012. ISBN 978-1-61499-040-6 (print); ISBN 978-1-61499-041-3 (online).Google Scholar
- J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics, 27(1-2):215--227, 1989.Google ScholarCross Ref
- V. Farra and R. Madariaga. Seismic waveform modeling in heterogeneous media by ray perturbation theory. Journal of Geophysical Research: Solid Earth, 92(B3):2697--2712, 1987.Google Scholar
- V. Fernando and B. Parlett. Accurate singular values and differential qd algorithms. Numerisch Math., 67:191--229, 1994.Google ScholarCross Ref
- W. Gansterer, D. Kvasnicka, and C. Ueberhuber. Multi-sweep algorithms for the symmetric eigenproblem. In Vector and Parallel Processing - VECPAR'98, volume 1573 of Lecture Notes in Computer Science, pages 20--28. Springer, 1999. Google ScholarDigital Library
- G. H. Golub and W. Kahan. Calculating the singular values and pseudoinverse of a matrix. SIAM J. Numer. Anal., 2(3):205--224, 1965.Google Scholar
- G. H. Golub and C. F. V. Loan. Matrix Computations. The John Hopkins University Press, 4th edition, December 27 2012. ISBN-10: 1421407949, ISBN-13: 978-1421407944.Google Scholar
- G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. In J. Wilkinson and C. Reinsch, editors, Handbook for Automattc Computation, II, Linear Algebra. Springer-Verlag, New York, 1971.Google Scholar
- G. H. Golub and J. H. Wilkinson. Ill-conditioned eigensystems and the computation of the Jordan canonical form. SIAM Rev., 18(4), October 1976.Google ScholarDigital Library
- R. Grimes, H. Krakauer, J. Lewis, H. Simon, and S.-H. Wei. The solution of large dense generalized eigenvalue problems on the cray X-MP/24 with SSD. J. Comput. Phys., 69:471--481, April 1987. Google ScholarDigital Library
- R. G. Grimes and H. D. Simon. Solution of large, dense symmetric generalized eigenvalue problems using secondary storage. ACM Transactions on Mathematical Software, 14:241--256, September 1988. Google ScholarDigital Library
- M. Gu and S. Eisenstat. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J. Mat. Anal. Appl., 16:79--92, 1995. Google ScholarDigital Library
- F. G. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans. Math. Soft., 38(3):article 17, 2012. DOI: 10.1145/2168773.2168775. Google ScholarDigital Library
- A. Haidar, H. Ltaief, and J. Dongarra. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of SC '11, pages 8:1--8:11, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. Haidar, H. Ltaief, P. Luszczek, and J. Dongarra. A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, May 21-25 2012. ISBN 978-1-4673-0975-2. Google ScholarDigital Library
- A. Haidar, H. Ltaief, A. YarKhan, and J. J. Dongarra. Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. Concurrency Computat.: Pract. Exper., 2011. DOI: 10.1002/cpe.1829. Google ScholarDigital Library
- A. Haidar, P. Luszczek, and J. Dongarra. New multi-stage algorithm for symmetric eigenvalues and eigenvectors achieves two-fold speedup. In Euro-Par 2013, Aachen, Germany, August 26--30 2013. (submitted).Google Scholar
- A. Haidar, S. Tomov, J. Dongarra, R. Solca, and T. Schulthess. A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks. International Journal of High Performance Computing Applications, September 2012. (accepted).Google Scholar
- R. J. Hanson. A numerical method for solving Fredholm integral equations of the first kind using singular values. SIAM J. Numer. Anal., 8(3):616--626, 1971.Google ScholarCross Ref
- H. Hotelling. Analysis of a complex of statistical variables into principal components. J. Educ. Psych., 24:417--441, 498--520, 1933.Google ScholarCross Ref
- H. Hotelling. Simplified calculation of principal components. Psychometrica, 1:27--35, 1935.Google ScholarCross Ref
- A. S. Householder. Unitary triangularization of a nonsymmetric matrix. Journal of the ACM (JACM), 5(4), October 1958. DOI 10.1145/320941.320947. Google ScholarDigital Library
- Intel. Math Kernel Library.Google Scholar
- E. R. Jessup and D. Sorensen. A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix. SIAM J. Matrix Anal. Appl., 15:530--548, 1994. Google ScholarDigital Library
- E. P. Jiang and M. W. Berry. Information filtering using the Riemannian SVD (R-SVD). In A. Ferreira, J. D. P. Rolim, H. D. Simon, and S.-H. Teng, editors, Solving Irregularly Structured Problems in Parallel, 5th International Symposium, IRREGULAR 98, Berkeley, California, USA, August 9--11, 1998, Proceedings, volume 1457 of Lecture Notes in Computer Science, pages 386--395. Springer, 1998. Google ScholarDigital Library
- J. Kurzak, A. Buttari, and J. J. Dongarra. Solving systems of linear equation on the CELL processor using Cholesky factorization. Trans. Parallel Distrib. Syst., 19(9):1175--1186, 2008. DOI: TPDS.2007.70813. Google ScholarDigital Library
- J. Kurzak and J. J. Dongarra. QR factorization for the CELL processor. Scientific Programming, 00:1--12, 2008. DOI: 10.3233/SPR-2008-0268.Google Scholar
- J. Kurzak, H. Ltaief, J. J. Dongarra, and R. M. Badia. Scheduling dense linear algebra operations on multicore processors. Concurrency Computat.: Pract. Exper., 21(1):15--44, 2009. DOI: 10.1002/cpe.1467. Google ScholarCross Ref
- B. Lang. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput., 14:1320--1338, November 1993. Google ScholarDigital Library
- B. Lang. Efficient eigenvalue and singular value computations on shared memory machines. Parallel Computing, 25(7):845--860, 1999. Google ScholarDigital Library
- C. Lawson and R. Hanson. Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs, N.J., 1974.Google Scholar
- H. Ltaief, J. Kurzak, and J. Dongarra. Parallel band two-sided matrix bidiagonalization for multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 21(4), April 2010. Google ScholarDigital Library
- H. Ltaief, P. Luszczek, and J. Dongarra. High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures. ACM TOMS, 39(3), 2013. In publication. Google ScholarDigital Library
- H. Ltaief, P. Luszczek, A. Haidar, and J. Dongarra. Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction. In R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, editors, Proceedings of 9th International Conference, PPAM 2011, volume 7203, pages 661--670, Torun, Poland, 2012. Google ScholarDigital Library
- P. Luszczek, H. Ltaief, and J. Dongarra. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In IPDPS 2011: IEEE International Parallel and Distributed Processing Symposium, Anchorage, Alaska, USA, May 16--20 2011. Google ScholarDigital Library
- B. P. M. Berry, D. Mezher and A. Sameh. Parallel methods for the singular value decomposition. In E. Kontoghiorghes, editor, Parallel Computing and Statistics, pages 117--164. Chapman & Hall/CRC, 2006.Google Scholar
- B. C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Transactions on Automatic Control, AC-26(1), February 1981.Google ScholarCross Ref
- G. W. Stewart. Introduction to Matrix Computations. Academic Press, New York, 1973.Google Scholar
- G. W. Stewart. The decompositional approach to matrix computation. Computing in Science & Engineering, 2(1):50--59, Jan/Feb 2000. ISSN: 1521-9615; DOI 10.1109/5992.814658. Google ScholarDigital Library
- University of Tennessee Knoxville. PLASMA Users' Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.3, November 2010.Google Scholar
- L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), Aug. 1990. DOI 10.1145/79173.79181. Google ScholarDigital Library
- P. R. Willems, B. Lang, and C. Vömel. Computing the bidiagonal SVD using multiple relatively robust representations. Technical Report CSD-05-1376, University of California Berkeley, Computer Science Division, 2005. Also available as the LAPACK Working Note 166.Google Scholar
- A. YarKhan, J. Kurzak, and J. Dongarra. QUARK users' guide: QUeueing And Runtime for Kernels. Technical Report ICL-UT-11-02, Innovative Computing Laboratory, University of Tennessee, April 2011.Google Scholar
Index Terms
- An improved parallel singular value algorithm and its implementation for multicore hardware
Recommendations
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingThe enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on ...
Downdating the Singular Value Decomposition
Let $A$ be a matrix with known singular values and left and/or right singular vectors, and let $A'$ be the matrix obtained by deleting a row from $A$. We present efficient and stable algorithms for computing the singular values and left and/or right ...
Analysis of a QR Algorithm for Computing Singular Values
We extend the Golub--Kahan algorithm for computing the singular value decomposition of bidiagonal matrices to triangular matrices $R$. Our algorithm avoids the explicit formation of $R^TR$ or $RR^T$.
We derive a relation between left and right ...
Comments