skip to main content
10.1145/1504176.1504196acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Solving dense linear systems on platforms with multiple hardware accelerators

Published: 14 February 2009 Publication History

Abstract

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.

References

[1]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.
[2]
S. Barrachina, M. Castillo, F. Igual, and R. Mayo adn E. S. Quintana-Ortí. Solving dense linear systems on graphics processors. In Eds. E. Luque et al., editor, Proceedings of Euro-Par 2008, number 5168 in LNCS, pages 739--748. Springer-Verlag Berlin Heidelberg, 2008.
[3]
S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana-Ortí. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing -- PDSEC'08, 2008. (CD-ROM).
[4]
Pieter Bellens, Josep M. Pérez, Rosa M. Badía, and Jesús Labarta. Cells: a programming model for the Cell BE architecture. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing -- SC2006, page 86, New York, NY, USA, 2006. ACM Press.
[5]
Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, 2006.
[6]
Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft., 31(1):1--26, March 2005.
[7]
Paolo Bientinesi, Brian Gunter, and Robert A. van de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software, 35(1):3, July 2008. Article 3, 22 pages.
[8]
Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 190 UT-CS-07-600, University of Tennessee, September 2007.
[9]
Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. LAPACK Working Note 190 UT-CS-07-598, University of Tennessee, July 2007.
[10]
Maribel Castillo, Ernie Chan, Francisco D. Igual, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, Robert van de Geijn, and Field G. Van Zee. Making parallel programming synonymous with programming for linear algebra libraries. FLAME Working Note #31 TR-08-20, The University of Texas at Austin, Department of Computer Sciences, April 2009.
[11]
Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures, pages 116--125, San Diego, CA, USA, June 9-11 2007. ACM.
[12]
Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In ACM SIGPLAN 2008 symposium on Principles and Practices of Parallel Programming -- PPoPP 2008, pages 123--132, 2008.
[13]
Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. Satisfying your dependencies with SuperMatrix. In Proceedings of IEEE Cluster Computing 2007, pages 91--99, September 2007.
[14]
J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 120--127. IEEE Comput. Soc. Press, 1992.
[15]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1--17, March 1990.
[16]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1--17, March 1988.
[17]
Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3--45, 2004.
[18]
Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha. LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing -- SC2005, page 3, Washington, DC, USA, 2005. IEEE Computer Society.
[19]
David Geer. Chip makers turn to multicore processors. Computer, 38(5):11--13, 2005.
[20]
John A. Gunnels. A Systematic Approach to the Design and Analysis of Parallel Dense Linear Algebra Algorithms. PhD thesis, Department of Computer Sciences, The University of Texas, December 2001.
[21]
John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Soft., 27(4):422--455, December 2001.
[22]
Jia Guo, Ganesh Bikshandi, Basilio Fraguela, Maria Garzaran, and David Padua. Programming with tiles. In PPoPP '08: The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, UT, USA, 2008.
[23]
J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman, 3rd edition, 2003.
[24]
Jin Hyuk Junk and Dianne P. O'Leary. Cholesky decomposition and linear programming on a GPU. Master's thesis, University of Maryland, College Park.
[25]
Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Solving systems of linear equations on the CELL processor using Cholesky factorization. LAPACK Working Note 184 UT-CS-07-596, University of Tennessee, May 2007.
[26]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5(3):308--323, Sept. 1979.
[27]
C. Leiserson and A. Plaat. Programming parallel applications in Cilk. SINEWS: SIAM News, 1998.
[28]
Bryan A. Marker, Field G. Van Zee, Kazushige Goto, Gregorio Quintana-Ortí, and Robert A. van de Geijn. Toward scalable matrix multiply on multithreaded architectures. In European Conference on Parallel and Distributed Computing -- Euro-Par 2007, pages 748--757, February 2007.
[29]
Enrique S. Quintana-Ortí and Robert A. van de Geijn. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Soft., 29(2):218--243, June 2003.
[30]
Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design and scheduling of an algorithm-by-blocks for LU factorization on multithreaded architectures. FLAME Working Note #26 TR-07-50, The University of Texas at Austin, Department of Computer Sciences, September 2007.
[31]
Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization. In Workshop on Multithreaded Architectures and Applications -- MTAAP 2008, 2008. CD-ROM.
[32]
Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Field G. Van Zee, and Robert A. van de Geijn. Scheduling of QR factorization algorithms on SMP and multi-core architectures.
[33]
In F. Spies D. El Baz, J. Bourgeois, editor, 16th Euromicro International Conference on Parallel, Distributed and Network-based Processing -- PDP 2008, pages 301--310, 2008.
[34]
Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Alfredo Remón, and Robert van de Geijn. Supermatrix for the factorization of band matrices. FLAME Working Note #27 TR-07-51, The University of Texas at Austin, Department of Computer Sciences, September 2007.
[35]
Peter Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia, 1998.
[36]
Vinod Valsalam and Anthony Skjellum. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience, 14(10):805--840, 2002.
[37]
Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997.
[38]
Vasily Volkov and James Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.
[39]
Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.
[40]
David S. Wise, Jeremy D. Frens, Yuhong Gu, and Gregory A. Alexander. Language support for Morton-order matrices. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming -- PPoPP 2001, pages 24--33, New York, NY, USA, 2001. ACM Press.

Cited By

View all
  • (2019)Introducing: The Libflame Library for Dense Matrix ComputationsComputing in Science & Engineering10.1109/MCSE.2009.154(1-1)Online publication date: 2019
  • (2018)Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple DevicesAccelerator Programming Using Directives10.1007/978-3-319-74896-2_9(160-182)Online publication date: 31-Jan-2018
  • (2016)Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.045(321-330)Online publication date: Dec-2016
  • Show More Cited By

Index Terms

  1. Solving dense linear systems on platforms with multiple hardware accelerators

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
      February 2009
      322 pages
      ISBN:9781605583976
      DOI:10.1145/1504176
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 44, Issue 4
        PPoPP '09
        April 2009
        294 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1594835
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 February 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. algorithms-by-blocks
      2. depencency analysis
      3. dynamic scheduling
      4. gpus
      5. out-of-order execution

      Qualifiers

      • Research-article

      Conference

      PPoPP09
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 230 of 1,014 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)15
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2019)Introducing: The Libflame Library for Dense Matrix ComputationsComputing in Science & Engineering10.1109/MCSE.2009.154(1-1)Online publication date: 2019
      • (2018)Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple DevicesAccelerator Programming Using Directives10.1007/978-3-319-74896-2_9(160-182)Online publication date: 31-Jan-2018
      • (2016)Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.045(321-330)Online publication date: Dec-2016
      • (2015)SemCache++: semantics-aware caching for efficient multi-GPU offloadingACM SIGPLAN Notices10.1145/2858788.268852750:8(255-256)Online publication date: 24-Jan-2015
      • (2015)SemCache++Proceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751210(79-88)Online publication date: 8-Jun-2015
      • (2015)Supporting multiple accelerators in high-level programming modelsProceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2712386.2712405(170-180)Online publication date: 7-Feb-2015
      • (2015)SemCache++: semantics-aware caching for efficient multi-GPU offloadingProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688527(255-256)Online publication date: 24-Jan-2015
      • (2015)Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU NodesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232174226:7(1814-1825)Online publication date: 10-Jun-2015
      • (2015)Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon PhiComputers and Electrical Engineering10.1016/j.compeleceng.2015.06.00946:C(95-111)Online publication date: 1-Aug-2015
      • (2015)A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systemsConcurrency and Computation: Practice & Experience10.1002/cpe.340327:14(3702-3723)Online publication date: 25-Sep-2015
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media