skip to main content
10.1145/2503210.2503292acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

An improved parallel singular value algorithm and its implementation for multicore hardware

Published:17 November 2013Publication History

ABSTRACT

The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance.

In this article, we describe a successful methodology to address these challenges---starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver.

Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.

References

  1. E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser., 180(1), 2009. DOI: 10.1088/1742-6596/180/1/012037.Google ScholarGoogle Scholar
  2. E. Agullo, B. Hadri, H. Ltaief, and J. Dongarrra. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, PA, 1992. http://www.netlib.org/lapack/lug/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. C. Andrews and C. L. Patterson. Singular value decompositions and digital image processing. IEEE Trans Acoust., Speech, Signal Processing ASSP-24, 1, February 1976.Google ScholarGoogle Scholar
  5. R. H. Bartels, G. H. Golub, and M. Saunders. Numerical techniques in mathematical programming. In Nonhnear Programming, pages 123--176, New York, 1971. Academm Press.Google ScholarGoogle Scholar
  6. M. Bečka, G. Okša, and M. Vajteršic. Parallel Block-Jacobi SVD. In M. W. Berry, K. A. Gallivan, E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, and F. Saied, editors, Methods High-Performance Scientific Computing, pages 185--197. Springer, London Dordrecht Heidelberg New York, 2012. ISBN 978-1-4471-2436-8 e-ISBN 978-1-4471-2437-5 DOI 10.1007/978-1-4471-2437-5.Google ScholarGoogle Scholar
  7. C. Bischof, B. Lang, and X. Sun. Parallel tridiagonalization through two-step band reduction. In In Proceedings of the Scalable High-Performance Computing Conference, pages 23--27. IEEE Computer Society Press, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  8. C. H. Bischof, B. Lang, and X. Sun. Algorithm 807: The SBR Toolbox---software for successive band reduction. ACM TOMS, 26(4):602--616, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. H. Bischof and C. V. Loan. The WY representation for products of Householder matrices. SIAM J. Sci. Statist. Comput., 8:s2--s13, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Bregman, R. Bailey, and C. Chapman. Ghosts in tomography: the effects of poor angular coverage in 2-D seismic traveltime inversion. Can. J. Explor. Geophys, 25(1):7--27, 1989.Google ScholarGoogle Scholar
  11. A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. The impact of multicore on math software. In B. Kågström, E. Elmroth, J. Dongarra, and J. Wasniewski, editors, Applied Parallel Computing. State of the Art in Scientific Computing, 8th International Workshop, PARA, volume 4699 of Lecture Notes in Computer Science, pages 1--10. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Buttari, J. J. Dongarra, P. Husbands, J. Kurzak, and K. Yelick. Multithreading for synchronization tolerance in matrix factorization. In Scientific Discovery through Advanced Computing, SciDAC 2007, Boston, MA, June 24--28 2007. Journal of Physics: Conference Series 78:012028, IOP Publishing. DOI: 10.1088/1742-6596/78/1/012028.Google ScholarGoogle Scholar
  13. A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. Parallel tiled QR factorization for multicore architectures. Concurrency Computat.: Pract. Exper., 20(13):1573--1590, 2008. DOI: 10.1002/cpe.1301. Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parellel Comput. Syst. Appl., 35:38--53, 2009. DOI: 10.1016/j.parco.2008.10.002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 116--125, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. A. Davis and S. Rajamanickam. Algorithm 8xx: PIRO BAND, Pipelined Plane Rotations for Blocked Band Reduction. Submitted to ACM TOMS, Available at www.cise.ufl.edu/~srajaman/genband.pdf, 2010.Google ScholarGoogle Scholar
  17. P. Deift, J. W. Demmel, L.-C. Li, and C. Tomei. The bidiagonal singular value decomposition and Hamiltonian mechanics. SIAM J. Numer. Anal., 28(5):1463--1516, October 1991. (LAPACK Working Note #11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. W. Demmel and W. Kahan. Accurate singular values of bidiagonal matrices. SIAM J. Sci. Stat. Comput., 11(5):873--912, September 1990. (Also LAPACK LAWN #3). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. Exploiting fine-grain parallelism in recursive LU factorization. In ParCo 2011 -- International Conference on Parallel Computing, Ghent, Belgium, August 30-September 2 2011.Google ScholarGoogle Scholar
  20. J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. High performance matrix inversion based on LU factorization for multicore architectures. In Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers, MTAGS '11, pages 33--42, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. Exploiting fine-grain parallelism in recursive LU factorization. Advances in Parallel Computing, Special Issue, 22:429--436, 2012. ISBN 978-1-61499-040-6 (print); ISBN 978-1-61499-041-3 (online).Google ScholarGoogle Scholar
  22. J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics, 27(1-2):215--227, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  23. V. Farra and R. Madariaga. Seismic waveform modeling in heterogeneous media by ray perturbation theory. Journal of Geophysical Research: Solid Earth, 92(B3):2697--2712, 1987.Google ScholarGoogle Scholar
  24. V. Fernando and B. Parlett. Accurate singular values and differential qd algorithms. Numerisch Math., 67:191--229, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  25. W. Gansterer, D. Kvasnicka, and C. Ueberhuber. Multi-sweep algorithms for the symmetric eigenproblem. In Vector and Parallel Processing - VECPAR'98, volume 1573 of Lecture Notes in Computer Science, pages 20--28. Springer, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. H. Golub and W. Kahan. Calculating the singular values and pseudoinverse of a matrix. SIAM J. Numer. Anal., 2(3):205--224, 1965.Google ScholarGoogle Scholar
  27. G. H. Golub and C. F. V. Loan. Matrix Computations. The John Hopkins University Press, 4th edition, December 27 2012. ISBN-10: 1421407949, ISBN-13: 978-1421407944.Google ScholarGoogle Scholar
  28. G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. In J. Wilkinson and C. Reinsch, editors, Handbook for Automattc Computation, II, Linear Algebra. Springer-Verlag, New York, 1971.Google ScholarGoogle Scholar
  29. G. H. Golub and J. H. Wilkinson. Ill-conditioned eigensystems and the computation of the Jordan canonical form. SIAM Rev., 18(4), October 1976.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Grimes, H. Krakauer, J. Lewis, H. Simon, and S.-H. Wei. The solution of large dense generalized eigenvalue problems on the cray X-MP/24 with SSD. J. Comput. Phys., 69:471--481, April 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. G. Grimes and H. D. Simon. Solution of large, dense symmetric generalized eigenvalue problems using secondary storage. ACM Transactions on Mathematical Software, 14:241--256, September 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Gu and S. Eisenstat. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J. Mat. Anal. Appl., 16:79--92, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. G. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans. Math. Soft., 38(3):article 17, 2012. DOI: 10.1145/2168773.2168775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Haidar, H. Ltaief, and J. Dongarra. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of SC '11, pages 8:1--8:11, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Haidar, H. Ltaief, P. Luszczek, and J. Dongarra. A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, May 21-25 2012. ISBN 978-1-4673-0975-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Haidar, H. Ltaief, A. YarKhan, and J. J. Dongarra. Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. Concurrency Computat.: Pract. Exper., 2011. DOI: 10.1002/cpe.1829. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Haidar, P. Luszczek, and J. Dongarra. New multi-stage algorithm for symmetric eigenvalues and eigenvectors achieves two-fold speedup. In Euro-Par 2013, Aachen, Germany, August 26--30 2013. (submitted).Google ScholarGoogle Scholar
  38. A. Haidar, S. Tomov, J. Dongarra, R. Solca, and T. Schulthess. A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks. International Journal of High Performance Computing Applications, September 2012. (accepted).Google ScholarGoogle Scholar
  39. R. J. Hanson. A numerical method for solving Fredholm integral equations of the first kind using singular values. SIAM J. Numer. Anal., 8(3):616--626, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  40. H. Hotelling. Analysis of a complex of statistical variables into principal components. J. Educ. Psych., 24:417--441, 498--520, 1933.Google ScholarGoogle ScholarCross RefCross Ref
  41. H. Hotelling. Simplified calculation of principal components. Psychometrica, 1:27--35, 1935.Google ScholarGoogle ScholarCross RefCross Ref
  42. A. S. Householder. Unitary triangularization of a nonsymmetric matrix. Journal of the ACM (JACM), 5(4), October 1958. DOI 10.1145/320941.320947. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Intel. Math Kernel Library.Google ScholarGoogle Scholar
  44. E. R. Jessup and D. Sorensen. A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix. SIAM J. Matrix Anal. Appl., 15:530--548, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. E. P. Jiang and M. W. Berry. Information filtering using the Riemannian SVD (R-SVD). In A. Ferreira, J. D. P. Rolim, H. D. Simon, and S.-H. Teng, editors, Solving Irregularly Structured Problems in Parallel, 5th International Symposium, IRREGULAR 98, Berkeley, California, USA, August 9--11, 1998, Proceedings, volume 1457 of Lecture Notes in Computer Science, pages 386--395. Springer, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Kurzak, A. Buttari, and J. J. Dongarra. Solving systems of linear equation on the CELL processor using Cholesky factorization. Trans. Parallel Distrib. Syst., 19(9):1175--1186, 2008. DOI: TPDS.2007.70813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. Kurzak and J. J. Dongarra. QR factorization for the CELL processor. Scientific Programming, 00:1--12, 2008. DOI: 10.3233/SPR-2008-0268.Google ScholarGoogle Scholar
  48. J. Kurzak, H. Ltaief, J. J. Dongarra, and R. M. Badia. Scheduling dense linear algebra operations on multicore processors. Concurrency Computat.: Pract. Exper., 21(1):15--44, 2009. DOI: 10.1002/cpe.1467. Google ScholarGoogle ScholarCross RefCross Ref
  49. B. Lang. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput., 14:1320--1338, November 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. B. Lang. Efficient eigenvalue and singular value computations on shared memory machines. Parallel Computing, 25(7):845--860, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. C. Lawson and R. Hanson. Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs, N.J., 1974.Google ScholarGoogle Scholar
  52. H. Ltaief, J. Kurzak, and J. Dongarra. Parallel band two-sided matrix bidiagonalization for multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 21(4), April 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. H. Ltaief, P. Luszczek, and J. Dongarra. High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures. ACM TOMS, 39(3), 2013. In publication. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. H. Ltaief, P. Luszczek, A. Haidar, and J. Dongarra. Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction. In R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, editors, Proceedings of 9th International Conference, PPAM 2011, volume 7203, pages 661--670, Torun, Poland, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. P. Luszczek, H. Ltaief, and J. Dongarra. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In IPDPS 2011: IEEE International Parallel and Distributed Processing Symposium, Anchorage, Alaska, USA, May 16--20 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. B. P. M. Berry, D. Mezher and A. Sameh. Parallel methods for the singular value decomposition. In E. Kontoghiorghes, editor, Parallel Computing and Statistics, pages 117--164. Chapman & Hall/CRC, 2006.Google ScholarGoogle Scholar
  57. B. C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Transactions on Automatic Control, AC-26(1), February 1981.Google ScholarGoogle ScholarCross RefCross Ref
  58. G. W. Stewart. Introduction to Matrix Computations. Academic Press, New York, 1973.Google ScholarGoogle Scholar
  59. G. W. Stewart. The decompositional approach to matrix computation. Computing in Science & Engineering, 2(1):50--59, Jan/Feb 2000. ISSN: 1521-9615; DOI 10.1109/5992.814658. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. University of Tennessee Knoxville. PLASMA Users' Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.3, November 2010.Google ScholarGoogle Scholar
  61. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), Aug. 1990. DOI 10.1145/79173.79181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. P. R. Willems, B. Lang, and C. Vömel. Computing the bidiagonal SVD using multiple relatively robust representations. Technical Report CSD-05-1376, University of California Berkeley, Computer Science Division, 2005. Also available as the LAPACK Working Note 166.Google ScholarGoogle Scholar
  63. A. YarKhan, J. Kurzak, and J. Dongarra. QUARK users' guide: QUeueing And Runtime for Kernels. Technical Report ICL-UT-11-02, Innovative Computing Laboratory, University of Tennessee, April 2011.Google ScholarGoogle Scholar

Index Terms

  1. An improved parallel singular value algorithm and its implementation for multicore hardware

      Recommendations

      Reviews

      Sanzheng Qiao

      The singular value decomposition (SVD) has numerous applications, including signal processing, data compression, principal component analysis (PCA), pattern recognition, and so on. Many applications involve large-size data matrices; however, the SVD is computationally intensive. This paper presents a high-performance SVD algorithm for large matrices. Its high performance is achieved by utilizing multicore architecture for parallelism and exploiting locality to reduce the traffic between fast memory (caches) and slow memory (disk). Typically, the computation of the SVD of a matrix consists of two parts: the matrix is reduced to an upper/lower bidiagonal matrix, and then the SVD of the bidiagonal is computed. The bidiagonalization part dominates the total computation. This paper focuses on the bidiagonalization part. The authors adopt a two-stage method for bidiagonalization. In the first stage, the matrix is reduced to a band matrix, which is further reduced to the bidiagonal in the second stage. Two techniques are applied: the tile algorithm, where a matrix is partitioned into blocks called tiles, and hybrid scheduling, a combination of a direct acyclic graph (DAG) used for data dependency and a superscalar scheduler called queueing and runtime for kernels (QUARK) used for dynamic scheduling. The first stage is computation bound and relatively simple. The second stage, however, is memory bound. Due to the data dependency introduced by the bulge chasing in the second stage, it is quite involved, especially when the accumulation of the orthogonal/unitary transformations is required for the computation of the singular vectors. It requires a careful design of the data dependency DAG and is prone to errors. As an example, figure 6(a) shows a lower triangular matrix partitioned into block columns, each of which consists of alternating red and green diamond tiles. Since the tiles are in a diamond shape, a tile shares rows with those immediately above or below in the same block column. Consequently, a tile is dependent on the tile immediately above it. In particular, tile 13 (red) in the figure depends on tile 12 (the green diamond above 13). However, the corresponding dependency graph (DAG) in figure 6(c) has no edge going into node 13 from node 12, which means that node 13 can run concurrently with node 12, violating the data dependency shown in figure 6(a). In section 3, the authors claim that section 5.4 provides further information about the combination of the computation splitting and hybrid scheduling. Unfortunately, section 5.4 is very sketchy. Some details can be found in Ltaief at al.'s paper, but there is no discussion of the accumulation of the orthogonal/unitary transformations. In the paper, the authors report an impressive performance of the algorithm. The reader should be aware that the algorithm is applicable to large-size (an order of magnitude of at least a couple thousand), dense, and square or near-square matrices. Different approaches are necessary for sparse or rectangular matrices. Moreover, the tile size is crucial to performance. The first stage, which is computation bound, favors large (around 200)-but not extremely large (more than 300)-tile sizes, whereas the second stage, which is memory bound, favors small (under 200) tile sizes. A fine-tuning of the tile size is necessary for high performance. Typically, a tile size around 150 is a good compromise. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
        November 2013
        1123 pages
        ISBN:9781450323789
        DOI:10.1145/2503210
        • General Chair:
        • William Gropp,
        • Program Chair:
        • Satoshi Matsuoka

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 November 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader