research-article

An improved parallel singular value algorithm and its implementation for multicore hardware

Authors:
Azzam Haidar

University of Tennessee, Knoxville, Tennessee

University of Tennessee, Knoxville, Tennessee
View Profile

,
Jakub Kurzak

University of Tennessee, Knoxville, Tennessee

University of Tennessee, Knoxville, Tennessee
View Profile

,
Piotr Luszczek

University of Tennessee, Knoxville, Tennessee

University of Tennessee, Knoxville, Tennessee
View Profile

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2013Article No.: 90Pages 1–12https://doi.org/10.1145/2503210.2503292

Published:17 November 2013Publication History

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance.

In this article, we describe a successful methodology to address these challenges---starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver.

Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.

References

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser., 180(1), 2009. DOI: 10.1088/1742-6596/180/1/012037.Google Scholar
E. Agullo, B. Hadri, H. Ltaief, and J. Dongarrra. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, PA, 1992. http://www.netlib.org/lapack/lug/. Google ScholarDigital Library
H. C. Andrews and C. L. Patterson. Singular value decompositions and digital image processing. IEEE Trans Acoust., Speech, Signal Processing ASSP-24, 1, February 1976.Google Scholar
R. H. Bartels, G. H. Golub, and M. Saunders. Numerical techniques in mathematical programming. In Nonhnear Programming, pages 123--176, New York, 1971. Academm Press.Google Scholar
M. Bečka, G. Okša, and M. Vajteršic. Parallel Block-Jacobi SVD. In M. W. Berry, K. A. Gallivan, E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, and F. Saied, editors, Methods High-Performance Scientific Computing, pages 185--197. Springer, London Dordrecht Heidelberg New York, 2012. ISBN 978-1-4471-2436-8 e-ISBN 978-1-4471-2437-5 DOI 10.1007/978-1-4471-2437-5.Google Scholar
C. Bischof, B. Lang, and X. Sun. Parallel tridiagonalization through two-step band reduction. In In Proceedings of the Scalable High-Performance Computing Conference, pages 23--27. IEEE Computer Society Press, 1994.Google ScholarCross Ref
C. H. Bischof, B. Lang, and X. Sun. Algorithm 807: The SBR Toolbox---software for successive band reduction. ACM TOMS, 26(4):602--616, 2000. Google ScholarDigital Library
C. H. Bischof and C. V. Loan. The WY representation for products of Householder matrices. SIAM J. Sci. Statist. Comput., 8:s2--s13, 1987. Google ScholarDigital Library
N. Bregman, R. Bailey, and C. Chapman. Ghosts in tomography: the effects of poor angular coverage in 2-D seismic traveltime inversion. Can. J. Explor. Geophys, 25(1):7--27, 1989.Google Scholar
A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. The impact of multicore on math software. In B. Kågström, E. Elmroth, J. Dongarra, and J. Wasniewski, editors, Applied Parallel Computing. State of the Art in Scientific Computing, 8th International Workshop, PARA, volume 4699 of Lecture Notes in Computer Science, pages 1--10. Springer, 2006. Google ScholarDigital Library
A. Buttari, J. J. Dongarra, P. Husbands, J. Kurzak, and K. Yelick. Multithreading for synchronization tolerance in matrix factorization. In Scientific Discovery through Advanced Computing, SciDAC 2007, Boston, MA, June 24--28 2007. Journal of Physics: Conference Series 78:012028, IOP Publishing. DOI: 10.1088/1742-6596/78/1/012028.Google Scholar
A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. Parallel tiled QR factorization for multicore architectures. Concurrency Computat.: Pract. Exper., 20(13):1573--1590, 2008. DOI: 10.1002/cpe.1301. Google ScholarCross Ref
A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parellel Comput. Syst. Appl., 35:38--53, 2009. DOI: 10.1016/j.parco.2008.10.002. Google ScholarDigital Library
E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 116--125, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
T. A. Davis and S. Rajamanickam. Algorithm 8xx: PIRO BAND, Pipelined Plane Rotations for Blocked Band Reduction. Submitted to ACM TOMS, Available at www.cise.ufl.edu/~srajaman/genband.pdf, 2010.Google Scholar
P. Deift, J. W. Demmel, L.-C. Li, and C. Tomei. The bidiagonal singular value decomposition and Hamiltonian mechanics. SIAM J. Numer. Anal., 28(5):1463--1516, October 1991. (LAPACK Working Note #11). Google ScholarDigital Library
J. W. Demmel and W. Kahan. Accurate singular values of bidiagonal matrices. SIAM J. Sci. Stat. Comput., 11(5):873--912, September 1990. (Also LAPACK LAWN #3). Google ScholarDigital Library
J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. Exploiting fine-grain parallelism in recursive LU factorization. In ParCo 2011 -- International Conference on Parallel Computing, Ghent, Belgium, August 30-September 2 2011.Google Scholar
J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. High performance matrix inversion based on LU factorization for multicore architectures. In Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers, MTAGS '11, pages 33--42, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. Exploiting fine-grain parallelism in recursive LU factorization. Advances in Parallel Computing, Special Issue, 22:429--436, 2012. ISBN 978-1-61499-040-6 (print); ISBN 978-1-61499-041-3 (online).Google Scholar
J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics, 27(1-2):215--227, 1989.Google ScholarCross Ref
V. Farra and R. Madariaga. Seismic waveform modeling in heterogeneous media by ray perturbation theory. Journal of Geophysical Research: Solid Earth, 92(B3):2697--2712, 1987.Google Scholar
V. Fernando and B. Parlett. Accurate singular values and differential qd algorithms. Numerisch Math., 67:191--229, 1994.Google ScholarCross Ref
W. Gansterer, D. Kvasnicka, and C. Ueberhuber. Multi-sweep algorithms for the symmetric eigenproblem. In Vector and Parallel Processing - VECPAR'98, volume 1573 of Lecture Notes in Computer Science, pages 20--28. Springer, 1999. Google ScholarDigital Library
G. H. Golub and W. Kahan. Calculating the singular values and pseudoinverse of a matrix. SIAM J. Numer. Anal., 2(3):205--224, 1965.Google Scholar
G. H. Golub and C. F. V. Loan. Matrix Computations. The John Hopkins University Press, 4th edition, December 27 2012. ISBN-10: 1421407949, ISBN-13: 978-1421407944.Google Scholar
G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. In J. Wilkinson and C. Reinsch, editors, Handbook for Automattc Computation, II, Linear Algebra. Springer-Verlag, New York, 1971.Google Scholar
G. H. Golub and J. H. Wilkinson. Ill-conditioned eigensystems and the computation of the Jordan canonical form. SIAM Rev., 18(4), October 1976.Google ScholarDigital Library
R. Grimes, H. Krakauer, J. Lewis, H. Simon, and S.-H. Wei. The solution of large dense generalized eigenvalue problems on the cray X-MP/24 with SSD. J. Comput. Phys., 69:471--481, April 1987. Google ScholarDigital Library
R. G. Grimes and H. D. Simon. Solution of large, dense symmetric generalized eigenvalue problems using secondary storage. ACM Transactions on Mathematical Software, 14:241--256, September 1988. Google ScholarDigital Library
M. Gu and S. Eisenstat. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J. Mat. Anal. Appl., 16:79--92, 1995. Google ScholarDigital Library
F. G. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans. Math. Soft., 38(3):article 17, 2012. DOI: 10.1145/2168773.2168775. Google ScholarDigital Library
A. Haidar, H. Ltaief, and J. Dongarra. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of SC '11, pages 8:1--8:11, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
A. Haidar, H. Ltaief, P. Luszczek, and J. Dongarra. A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, May 21-25 2012. ISBN 978-1-4673-0975-2. Google ScholarDigital Library
A. Haidar, H. Ltaief, A. YarKhan, and J. J. Dongarra. Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. Concurrency Computat.: Pract. Exper., 2011. DOI: 10.1002/cpe.1829. Google ScholarDigital Library
A. Haidar, P. Luszczek, and J. Dongarra. New multi-stage algorithm for symmetric eigenvalues and eigenvectors achieves two-fold speedup. In Euro-Par 2013, Aachen, Germany, August 26--30 2013. (submitted).Google Scholar
A. Haidar, S. Tomov, J. Dongarra, R. Solca, and T. Schulthess. A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks. International Journal of High Performance Computing Applications, September 2012. (accepted).Google Scholar
R. J. Hanson. A numerical method for solving Fredholm integral equations of the first kind using singular values. SIAM J. Numer. Anal., 8(3):616--626, 1971.Google ScholarCross Ref
H. Hotelling. Analysis of a complex of statistical variables into principal components. J. Educ. Psych., 24:417--441, 498--520, 1933.Google ScholarCross Ref
H. Hotelling. Simplified calculation of principal components. Psychometrica, 1:27--35, 1935.Google ScholarCross Ref
A. S. Householder. Unitary triangularization of a nonsymmetric matrix. Journal of the ACM (JACM), 5(4), October 1958. DOI 10.1145/320941.320947. Google ScholarDigital Library
Intel. Math Kernel Library.Google Scholar
E. R. Jessup and D. Sorensen. A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix. SIAM J. Matrix Anal. Appl., 15:530--548, 1994. Google ScholarDigital Library
E. P. Jiang and M. W. Berry. Information filtering using the Riemannian SVD (R-SVD). In A. Ferreira, J. D. P. Rolim, H. D. Simon, and S.-H. Teng, editors, Solving Irregularly Structured Problems in Parallel, 5th International Symposium, IRREGULAR 98, Berkeley, California, USA, August 9--11, 1998, Proceedings, volume 1457 of Lecture Notes in Computer Science, pages 386--395. Springer, 1998. Google ScholarDigital Library
J. Kurzak, A. Buttari, and J. J. Dongarra. Solving systems of linear equation on the CELL processor using Cholesky factorization. Trans. Parallel Distrib. Syst., 19(9):1175--1186, 2008. DOI: TPDS.2007.70813. Google ScholarDigital Library
J. Kurzak and J. J. Dongarra. QR factorization for the CELL processor. Scientific Programming, 00:1--12, 2008. DOI: 10.3233/SPR-2008-0268.Google Scholar
J. Kurzak, H. Ltaief, J. J. Dongarra, and R. M. Badia. Scheduling dense linear algebra operations on multicore processors. Concurrency Computat.: Pract. Exper., 21(1):15--44, 2009. DOI: 10.1002/cpe.1467. Google ScholarCross Ref
B. Lang. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput., 14:1320--1338, November 1993. Google ScholarDigital Library
B. Lang. Efficient eigenvalue and singular value computations on shared memory machines. Parallel Computing, 25(7):845--860, 1999. Google ScholarDigital Library
C. Lawson and R. Hanson. Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs, N.J., 1974.Google Scholar
H. Ltaief, J. Kurzak, and J. Dongarra. Parallel band two-sided matrix bidiagonalization for multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 21(4), April 2010. Google ScholarDigital Library
H. Ltaief, P. Luszczek, and J. Dongarra. High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures. ACM TOMS, 39(3), 2013. In publication. Google ScholarDigital Library
H. Ltaief, P. Luszczek, A. Haidar, and J. Dongarra. Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction. In R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, editors, Proceedings of 9th International Conference, PPAM 2011, volume 7203, pages 661--670, Torun, Poland, 2012. Google ScholarDigital Library
P. Luszczek, H. Ltaief, and J. Dongarra. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In IPDPS 2011: IEEE International Parallel and Distributed Processing Symposium, Anchorage, Alaska, USA, May 16--20 2011. Google ScholarDigital Library
B. P. M. Berry, D. Mezher and A. Sameh. Parallel methods for the singular value decomposition. In E. Kontoghiorghes, editor, Parallel Computing and Statistics, pages 117--164. Chapman & Hall/CRC, 2006.Google Scholar
B. C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Transactions on Automatic Control, AC-26(1), February 1981.Google ScholarCross Ref
G. W. Stewart. Introduction to Matrix Computations. Academic Press, New York, 1973.Google Scholar
G. W. Stewart. The decompositional approach to matrix computation. Computing in Science & Engineering, 2(1):50--59, Jan/Feb 2000. ISSN: 1521-9615; DOI 10.1109/5992.814658. Google ScholarDigital Library
University of Tennessee Knoxville. PLASMA Users' Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.3, November 2010.Google Scholar
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), Aug. 1990. DOI 10.1145/79173.79181. Google ScholarDigital Library
P. R. Willems, B. Lang, and C. Vömel. Computing the bidiagonal SVD using multiple relatively robust representations. Technical Report CSD-05-1376, University of California Berkeley, Computer Science Division, 2005. Also available as the LAPACK Working Note 166.Google Scholar
A. YarKhan, J. Kurzak, and J. Dongarra. QUARK users' guide: QUeueing And Runtime for Kernels. Technical Report ICL-UT-11-02, Innovative Computing Laboratory, University of Tennessee, April 2011.Google Scholar

Index Terms

An improved parallel singular value algorithm and its implementation for multicore hardware
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on ...
Read More
Downdating the Singular Value Decomposition

Let $A$ be a matrix with known singular values and left and/or right singular vectors, and let $A'$ be the matrix obtained by deleting a row from $A$. We present efficient and stable algorithms for computing the singular values and left and/or right ...
Read More
Analysis of a QR Algorithm for Computing Singular Values

We extend the Golub--Kahan algorithm for computing the singular value decomposition of bidiagonal matrices to triangular matrices $R$. Our algorithm avoids the explicit formation of $R^TR$ or $RR^T$.

We derive a relation between left and right ...
Read More

Reviews

Reviewer: Sanzheng Qiao

The singular value decomposition (SVD) has numerous applications, including signal processing, data compression, principal component analysis (PCA), pattern recognition, and so on. Many applications involve large-size data matrices; however, the SVD is computationally intensive. This paper presents a high-performance SVD algorithm for large matrices. Its high performance is achieved by utilizing multicore architecture for parallelism and exploiting locality to reduce the traffic between fast memory (caches) and slow memory (disk). Typically, the computation of the SVD of a matrix consists of two parts: the matrix is reduced to an upper/lower bidiagonal matrix, and then the SVD of the bidiagonal is computed. The bidiagonalization part dominates the total computation. This paper focuses on the bidiagonalization part. The authors adopt a two-stage method for bidiagonalization. In the first stage, the matrix is reduced to a band matrix, which is further reduced to the bidiagonal in the second stage. Two techniques are applied: the tile algorithm, where a matrix is partitioned into blocks called tiles, and hybrid scheduling, a combination of a direct acyclic graph (DAG) used for data dependency and a superscalar scheduler called queueing and runtime for kernels (QUARK) used for dynamic scheduling. The first stage is computation bound and relatively simple. The second stage, however, is memory bound. Due to the data dependency introduced by the bulge chasing in the second stage, it is quite involved, especially when the accumulation of the orthogonal/unitary transformations is required for the computation of the singular vectors. It requires a careful design of the data dependency DAG and is prone to errors. As an example, figure 6(a) shows a lower triangular matrix partitioned into block columns, each of which consists of alternating red and green diamond tiles. Since the tiles are in a diamond shape, a tile shares rows with those immediately above or below in the same block column. Consequently, a tile is dependent on the tile immediately above it. In particular, tile 13 (red) in the figure depends on tile 12 (the green diamond above 13). However, the corresponding dependency graph (DAG) in figure 6(c) has no edge going into node 13 from node 12, which means that node 13 can run concurrently with node 12, violating the data dependency shown in figure 6(a). In section 3, the authors claim that section 5.4 provides further information about the combination of the computation splitting and hybrid scheduling. Unfortunately, section 5.4 is very sketchy. Some details can be found in Ltaief at al.'s paper, but there is no discussion of the accumulation of the orthogonal/unitary transformations. In the paper, the authors report an impressive performance of the algorithm. The reader should be aware that the algorithm is applicable to large-size (an order of magnitude of at least a couple thousand), dense, and square or near-square matrices. Different approaches are necessary for sparse or rectangular matrices. Moreover, the tile size is crucial to performance. The first stage, which is computation bound, favors large (around 200)-but not extremely large (more than 300)-tile sizes, whereas the second stage, which is memory bound, favors small (under 200) tile sizes. A fine-tuning of the tile size is necessary for high performance. Typically, a tile size around 150 is a good compromise. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
eigenvalues and eigenvectors
performance
reduction to bidiagonal
singular value decomposition
task parallelism
Qualifiers
- research-article
Conference

Acceptance Rates
SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 363
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication

Downdating the Singular Value Decomposition

Analysis of a QR Algorithm for Computing Singular Values

Reviews

Access critical reviews of Computing literature here