research-article

Solving dense linear systems on platforms with multiple hardware accelerators

Authors:

Gregorio Quintana-Ortí,

Francisco D. Igual,

Enrique S. Quintana-Ortí,

Robert A. van de GeijnAuthors Info & Claims

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 121 - 130

https://doi.org/10.1145/1504176.1504196

Published: 14 February 2009 Publication History

Abstract

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.

References

[1]

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.

[2]

S. Barrachina, M. Castillo, F. Igual, and R. Mayo adn E. S. Quintana-Ortí. Solving dense linear systems on graphics processors. In Eds. E. Luque et al., editor, Proceedings of Euro-Par 2008, number 5168 in LNCS, pages 739--748. Springer-Verlag Berlin Heidelberg, 2008.

Digital Library

[3]

S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana-Ortí. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing -- PDSEC'08, 2008. (CD-ROM).

[4]

Pieter Bellens, Josep M. Pérez, Rosa M. Badía, and Jesús Labarta. Cells: a programming model for the Cell BE architecture. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing -- SC2006, page 86, New York, NY, USA, 2006. ACM Press.

Digital Library

[5]

Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, 2006.

Digital Library

[6]

Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft., 31(1):1--26, March 2005.

Digital Library

[7]

Paolo Bientinesi, Brian Gunter, and Robert A. van de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software, 35(1):3, July 2008. Article 3, 22 pages.

Digital Library

[8]

Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 190 UT-CS-07-600, University of Tennessee, September 2007.

[9]

Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. LAPACK Working Note 190 UT-CS-07-598, University of Tennessee, July 2007.

[10]

Maribel Castillo, Ernie Chan, Francisco D. Igual, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, Robert van de Geijn, and Field G. Van Zee. Making parallel programming synonymous with programming for linear algebra libraries. FLAME Working Note #31 TR-08-20, The University of Texas at Austin, Department of Computer Sciences, April 2009.

[11]

Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures, pages 116--125, San Diego, CA, USA, June 9-11 2007. ACM.

Digital Library

[12]

Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In ACM SIGPLAN 2008 symposium on Principles and Practices of Parallel Programming -- PPoPP 2008, pages 123--132, 2008.

Digital Library

[13]

Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. Satisfying your dependencies with SuperMatrix. In Proceedings of IEEE Cluster Computing 2007, pages 91--99, September 2007.

Digital Library

[14]

J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 120--127. IEEE Comput. Soc. Press, 1992.

[15]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1--17, March 1990.

Digital Library

[16]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1--17, March 1988.

Digital Library

[17]

Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3--45, 2004.

[18]

Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha. LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing -- SC2005, page 3, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[19]

David Geer. Chip makers turn to multicore processors. Computer, 38(5):11--13, 2005.

Digital Library

[20]

John A. Gunnels. A Systematic Approach to the Design and Analysis of Parallel Dense Linear Algebra Algorithms. PhD thesis, Department of Computer Sciences, The University of Texas, December 2001.

Digital Library

[21]

John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Soft., 27(4):422--455, December 2001.

Digital Library

[22]

Jia Guo, Ganesh Bikshandi, Basilio Fraguela, Maria Garzaran, and David Padua. Programming with tiles. In PPoPP '08: The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, UT, USA, 2008.

Digital Library

[23]

J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman, 3rd edition, 2003.

Digital Library

[24]

Jin Hyuk Junk and Dianne P. O'Leary. Cholesky decomposition and linear programming on a GPU. Master's thesis, University of Maryland, College Park.

[25]

Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Solving systems of linear equations on the CELL processor using Cholesky factorization. LAPACK Working Note 184 UT-CS-07-596, University of Tennessee, May 2007.

[26]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5(3):308--323, Sept. 1979.

Digital Library

[27]

C. Leiserson and A. Plaat. Programming parallel applications in Cilk. SINEWS: SIAM News, 1998.

[28]

Bryan A. Marker, Field G. Van Zee, Kazushige Goto, Gregorio Quintana-Ortí, and Robert A. van de Geijn. Toward scalable matrix multiply on multithreaded architectures. In European Conference on Parallel and Distributed Computing -- Euro-Par 2007, pages 748--757, February 2007.

Digital Library

[29]

Enrique S. Quintana-Ortí and Robert A. van de Geijn. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Soft., 29(2):218--243, June 2003.

Digital Library

[30]

Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design and scheduling of an algorithm-by-blocks for LU factorization on multithreaded architectures. FLAME Working Note #26 TR-07-50, The University of Texas at Austin, Department of Computer Sciences, September 2007.

[31]

Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization. In Workshop on Multithreaded Architectures and Applications -- MTAAP 2008, 2008. CD-ROM.

[32]

Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Field G. Van Zee, and Robert A. van de Geijn. Scheduling of QR factorization algorithms on SMP and multi-core architectures.

[33]

In F. Spies D. El Baz, J. Bourgeois, editor, 16th Euromicro International Conference on Parallel, Distributed and Network-based Processing -- PDP 2008, pages 301--310, 2008.

[34]

Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Alfredo Remón, and Robert van de Geijn. Supermatrix for the factorization of band matrices. FLAME Working Note #27 TR-07-51, The University of Texas at Austin, Department of Computer Sciences, September 2007.

[35]

Peter Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia, 1998.

[36]

Vinod Valsalam and Anthony Skjellum. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience, 14(10):805--840, 2002.

[37]

Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997.

Digital Library

[38]

Vasily Volkov and James Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.

[39]

Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[40]

David S. Wise, Jeremy D. Frens, Yuhong Gu, and Gregory A. Alexander. Language support for Morton-order matrices. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming -- PPoPP 2001, pages 24--33, New York, NY, USA, 2001. ACM Press.

Digital Library

Cited By

Van Zee FChan Evan de Geijn RQuintana EQuintana-Orti G(2019)Introducing: The Libflame Library for Dense Matrix ComputationsComputing in Science & Engineering10.1109/MCSE.2009.154(1-1)Online publication date: 2019
https://doi.org/10.1109/MCSE.2009.154
Hahnfeld JTerboven CPrice JPflug HMüller M(2018)Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple DevicesAccelerator Programming Using Directives10.1007/978-3-319-74896-2_9(160-182)Online publication date: 31-Jan-2018
https://doi.org/10.1007/978-3-319-74896-2_9
Beaumont OCojean TEyraud-Dubois LGuermouche AKumar S(2016)Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.045(321-330)Online publication date: Dec-2016
https://doi.org/10.1109/HiPC.2016.045
Show More Cited By

Index Terms

Solving dense linear systems on platforms with multiple hardware accelerators
1. Mathematics of computing
  1. Mathematical software
2. Software and its engineering

Recommendations

Solving dense linear systems on platforms with multiple hardware accelerators
PPoPP '09

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further ...
Optimized two-level parallelization for GPU accelerators using the polyhedral model
CC 2017: Proceedings of the 26th International Conference on Compiler Construction

While GPUs play an increasingly important role in today's high-performance computers, optimizing GPU performance continues to impose large burdens upon programmers. A major challenge in optimizing codes for GPUs stems from the two levels of hardware ...
Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

February 2009

322 pages

ISBN:9781605583976

DOI:10.1145/1504176

General Chair:
Daniel Reed
Microsoft Research, USA
,
Program Chair:
Vivek Sarkar
Rice University, USA

ACM SIGPLAN Notices Volume 44, Issue 4
PPoPP '09
April 2009
294 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1594835
Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP09

Sponsor:

PPoPP09: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 14 - 18, 2009

NC, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

80
Total Citations
View Citations
735
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Van Zee FChan Evan de Geijn RQuintana EQuintana-Orti G(2019)Introducing: The Libflame Library for Dense Matrix ComputationsComputing in Science & Engineering10.1109/MCSE.2009.154(1-1)Online publication date: 2019
https://doi.org/10.1109/MCSE.2009.154
Hahnfeld JTerboven CPrice JPflug HMüller M(2018)Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple DevicesAccelerator Programming Using Directives10.1007/978-3-319-74896-2_9(160-182)Online publication date: 31-Jan-2018
https://doi.org/10.1007/978-3-319-74896-2_9
Beaumont OCojean TEyraud-Dubois LGuermouche AKumar S(2016)Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.045(321-330)Online publication date: Dec-2016
https://doi.org/10.1109/HiPC.2016.045
Al-Saber NKulkarni M(2015)SemCache++: semantics-aware caching for efficient multi-GPU offloadingACM SIGPLAN Notices10.1145/2858788.268852750:8(255-256)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688527
Al-Saber NKulkarni MBhuyan LChong FSarkar V(2015)SemCache++Proceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751210(79-88)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751210
Yan YLin PLiao Cde Supinski BQuinlan D(2015)Supporting multiple accelerators in high-level programming modelsProceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2712386.2712405(170-180)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.1145/2712386.2712405
Al-Saber NKulkarni MCohen AGrove D(2015)SemCache++: semantics-aware caching for efficient multi-GPU offloadingProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688527(255-256)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2688500.2688527
Gangwon Jo Jeongho Nah Jun Lee Jungwon Kim Jaejin Lee (2015)Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU NodesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232174226:7(1814-1825)Online publication date: 10-Jun-2015
https://dl.acm.org/doi/10.1109/TPDS.2014.2321742
Dolz MIgual FLudwig TPiñuel LQuintana-Ortí E(2015)Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon PhiComputers and Electrical Engineering10.1016/j.compeleceng.2015.06.00946:C(95-111)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1016/j.compeleceng.2015.06.009
Song FDongarra J(2015)A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systemsConcurrency and Computation: Practice & Experience10.1002/cpe.340327:14(3702-3723)Online publication date: 25-Sep-2015
https://dl.acm.org/doi/10.1002/cpe.3403
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten