research-article

Scalable parallelization of FLAME code via the workqueuing model

Authors:

Field G. Van Zee,

Paolo Bientinesi,

Robert A. van de GeijnAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 34, Issue 2

Article No.: 10, Pages 1 - 29

https://doi.org/10.1145/1326548.1326552

Published: 19 March 2008 Publication History

Abstract

We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.

References

[1]

Anderson, E., Bai, Z., Demmel, J., Dongarra, J. E., DuCroz, J., Greenbaum, A., Hammarling, S., McKenney, A. E., Ostrouchov, S., and Sorensen, D. 1992. LAPACK Users' Guide. SIAM, Philadelphia.

Digital Library

[2]

Bientinesi, P. 2006. Mechanical derivation and systematic analysis of correct linear algebra algorithms. Ph.D. dissertation, Department of Computer Sciences, The University of Texas at Austin.

Digital Library

[3]

Bientinesi, P., Gunnels, J. A., Myers, M. E., Quintana-Ortí, E. S., and van de Geijn, R. A. 2005. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw. 31, 1 (March), 1--26.

Digital Library

[4]

Bientinesi, P., Quintana-Ortí, E. S., and van de Geijn, R. A. 2005. Representing linear algebra algorithms in code: The FLAME APIs. ACM Trans. Math. Softw. 31, 1 (March), 27--59.

Digital Library

[5]

Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (March), 1--17.

Digital Library

[6]

Dongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1 (March), 1--17.

Digital Library

[7]

Dow, E. 2005. Take charge of processor affinity. IBM developerWorks. http://www.ibm.com/developerworks/linux/library/l-affinity.html.

[8]

Garey, M. R., Graham, R. L., and Ullman, J. D. 1973. An analysis of some packing algorithms. In Combinatorial Algorithms, R. Rustin, Ed. Algorithmics Press, New York, 39--47.

[9]

Goto, K. 2006. http://www.cs.utexas.edu/users/kgoto.

[10]

Goto, K. and van de Geijn, R. A. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3.

Digital Library

[11]

Johnson, D. S. 1973. Approximation algorithms for combinatorial problems. In Fifth Annual ACM Symposium on Theory of Computing. ACM, New York, 38--49.

Digital Library

[12]

Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3 (Sept.), 308--323.

Digital Library

[13]

Low, T. M., Milfeld, K. F., van de Geijn, R. A., and Van Zee, F. G. 2004. Parallelizing FLAME code with OpenMP task queues. Department of Computer Sciences Tech. Rep. TR-04-05, The University of Texas at Austin (December).

[14]

Low, T. M., van de Geijn, R. A., and Van Zee, F. G. 2005. Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'05). ACM Press, New York, NY, 153--163.

Digital Library

[15]

OpenMP Architecture Review Board. 2006. http://www.openmp.org/.

[16]

Quintana-Ortí, E. S. and van de Geijn, R. A. 2003. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Softw. 29, 2 (June), 218--243.

Digital Library

[17]

Shah, S., Haab, G., Peterson, P., and Throop, J. 1999. Flexible control structures for parallelism in OpenMP. In P2, 6, 7 European Workshop on OpenMP (EWOMP).

[18]

Su, E., Tian, X., Girkar, M., Haab, G., Shah, S., and Peterson, P. 2002. Compiler support of the workqueuing execution model for Intel SMP architectures. In European Workshop on OpenMP (EWOMP).

[19]

van de Geijn, R. A. 1997. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press.

Digital Library

[20]

Weisstein, E. W. 2006. Bin-Packing Problem. From MathWorld---A Wolfram Web Resource. http://mathworld.wolfram.com/Bin-PackingProblem.html.

Cited By

Šinkarovs AKoopman TScholz SKeller GWestrick S(2023)Rank-Polymorphism for Shape-Guided BlockingProceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3609024.3609410(1-14)Online publication date: 30-Aug-2023
https://dl.acm.org/doi/10.1145/3609024.3609410
Alaejos GCastelló AMartínez HAlonso-Jordá PIgual FQuintana-Ortí E(2022)Micro-kernels for portable and efficient matrix multiplication in deep learningThe Journal of Supercomputing10.1007/s11227-022-05003-379:7(8124-8147)Online publication date: 14-Dec-2022
https://dl.acm.org/doi/10.1007/s11227-022-05003-3
San Juan PRodríguez-Sánchez RIgual FAlonso-Jordá PQuintana-Ortí E(2021)Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processorsThe Journal of Supercomputing10.1007/s11227-021-03636-477:10(11257-11269)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s11227-021-03636-4
Show More Cited By

Index Terms

Scalable parallelization of FLAME code via the workqueuing model
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Performance characteristics of openMP constructs, and application benchmarks on a large symmetric multiprocessor
ICS '03: Proceedings of the 17th annual international conference on Supercomputing

With the increasing popularity of small to large-scale symmetric multiprocessor (SMP) systems, there has been a dire need to have sophisticated, and flexible development and runtime environments for efficient and rapid development of parallel ...
Parallelizing dense linear algebra operations with task queues in llc
PVM/MPI'07: Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

llc is a language based on C where parallelism is expressed using compiler directives. The llc compiler produces MPI code which can be ported to both shared and distributed memory systems.

In this work we focus our attention in the llc implementation of ...
MPPs and clusters for scalable computing
ISPAN '96: Proceedings of the 1996 International Symposium on Parallel Architectures, Algorithms and Networks

This article assess the state-of-the-art technology in massively parallel processors (MPPs) and clusters of workstations (COWs) for scalable parallel computing. We evaluate the IBM SP2, the Intel Paragon, the Cray T3D/T3E, and the ASCI TeraFLOPS system ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 34, Issue 2

March 2008

143 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/1326548

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2008

Accepted: 01 April 2007

Revised: 01 April 2007

Received: 01 October 2006

Published in TOMS Volume 34, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
331
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Šinkarovs AKoopman TScholz SKeller GWestrick S(2023)Rank-Polymorphism for Shape-Guided BlockingProceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3609024.3609410(1-14)Online publication date: 30-Aug-2023
https://dl.acm.org/doi/10.1145/3609024.3609410
Alaejos GCastelló AMartínez HAlonso-Jordá PIgual FQuintana-Ortí E(2022)Micro-kernels for portable and efficient matrix multiplication in deep learningThe Journal of Supercomputing10.1007/s11227-022-05003-379:7(8124-8147)Online publication date: 14-Dec-2022
https://dl.acm.org/doi/10.1007/s11227-022-05003-3
San Juan PRodríguez-Sánchez RIgual FAlonso-Jordá PQuintana-Ortí E(2021)Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processorsThe Journal of Supercomputing10.1007/s11227-021-03636-477:10(11257-11269)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s11227-021-03636-4
Hasan MWhaley R(2014)Effectively Exploiting Parallel Scale for All Problem Sizes in LU FactorizationProceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium10.1109/IPDPS.2014.109(1039-1048)Online publication date: 19-May-2014
https://dl.acm.org/doi/10.1109/IPDPS.2014.109
Castaldo AWhaley RSamuel S(2013)Scaling LAPACK panel operations using parallel cache assignmentACM Transactions on Mathematical Software10.1145/2491491.249149239:4(1-30)Online publication date: 23-Jul-2013
https://dl.acm.org/doi/10.1145/2491491.2491492
Petschow MBientinesi P(2011)MR3-SMPParallel Computing10.1016/j.parco.2011.10.00137:12(795-805)Online publication date: 1-Dec-2011
https://dl.acm.org/doi/10.1016/j.parco.2011.10.001
Castaldo AWhaley R(2010)Scaling LAPACK panel operations using parallel cache assignmentACM SIGPLAN Notices10.1145/1837853.169348445:5(223-232)Online publication date: 9-Jan-2010
https://dl.acm.org/doi/10.1145/1837853.1693484
Castaldo AWhaley RGovindarajan RPadua DHall M(2010)Scaling LAPACK panel operations using parallel cache assignmentProceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/1693453.1693484(223-232)Online publication date: 9-Jan-2010
https://dl.acm.org/doi/10.1145/1693453.1693484
Chan EVan Zee FBientinesi PQuintana-Orti EQuintana-Orti Gvan de Geijn RChatterjee SScott M(2008)SuperMatrixProceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming10.1145/1345206.1345227(123-132)Online publication date: 20-Feb-2008
https://dl.acm.org/doi/10.1145/1345206.1345227
Jin HChapman BHuang Lan Mey DReichstein T(2008)Performance evaluation of a multi-zone application in different OpenMP approachesInternational Journal of Parallel Programming10.1007/s10766-008-0074-536:3(312-325)Online publication date: 1-Jun-2008
https://dl.acm.org/doi/10.1007/s10766-008-0074-5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents