skip to main content
research-article

Scalable parallelization of FLAME code via the workqueuing model

Published: 19 March 2008 Publication History

Abstract

We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.

References

[1]
Anderson, E., Bai, Z., Demmel, J., Dongarra, J. E., DuCroz, J., Greenbaum, A., Hammarling, S., McKenney, A. E., Ostrouchov, S., and Sorensen, D. 1992. LAPACK Users' Guide. SIAM, Philadelphia.
[2]
Bientinesi, P. 2006. Mechanical derivation and systematic analysis of correct linear algebra algorithms. Ph.D. dissertation, Department of Computer Sciences, The University of Texas at Austin.
[3]
Bientinesi, P., Gunnels, J. A., Myers, M. E., Quintana-Ortí, E. S., and van de Geijn, R. A. 2005. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw. 31, 1 (March), 1--26.
[4]
Bientinesi, P., Quintana-Ortí, E. S., and van de Geijn, R. A. 2005. Representing linear algebra algorithms in code: The FLAME APIs. ACM Trans. Math. Softw. 31, 1 (March), 27--59.
[5]
Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (March), 1--17.
[6]
Dongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1 (March), 1--17.
[7]
Dow, E. 2005. Take charge of processor affinity. IBM developerWorks. http://www.ibm.com/developerworks/linux/library/l-affinity.html.
[8]
Garey, M. R., Graham, R. L., and Ullman, J. D. 1973. An analysis of some packing algorithms. In Combinatorial Algorithms, R. Rustin, Ed. Algorithmics Press, New York, 39--47.
[9]
Goto, K. 2006. http://www.cs.utexas.edu/users/kgoto.
[10]
Goto, K. and van de Geijn, R. A. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3.
[11]
Johnson, D. S. 1973. Approximation algorithms for combinatorial problems. In Fifth Annual ACM Symposium on Theory of Computing. ACM, New York, 38--49.
[12]
Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3 (Sept.), 308--323.
[13]
Low, T. M., Milfeld, K. F., van de Geijn, R. A., and Van Zee, F. G. 2004. Parallelizing FLAME code with OpenMP task queues. Department of Computer Sciences Tech. Rep. TR-04-05, The University of Texas at Austin (December).
[14]
Low, T. M., van de Geijn, R. A., and Van Zee, F. G. 2005. Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'05). ACM Press, New York, NY, 153--163.
[15]
OpenMP Architecture Review Board. 2006. http://www.openmp.org/.
[16]
Quintana-Ortí, E. S. and van de Geijn, R. A. 2003. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Softw. 29, 2 (June), 218--243.
[17]
Shah, S., Haab, G., Peterson, P., and Throop, J. 1999. Flexible control structures for parallelism in OpenMP. In P2, 6, 7 European Workshop on OpenMP (EWOMP).
[18]
Su, E., Tian, X., Girkar, M., Haab, G., Shah, S., and Peterson, P. 2002. Compiler support of the workqueuing execution model for Intel SMP architectures. In European Workshop on OpenMP (EWOMP).
[19]
van de Geijn, R. A. 1997. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press.
[20]
Weisstein, E. W. 2006. Bin-Packing Problem. From MathWorld---A Wolfram Web Resource. http://mathworld.wolfram.com/Bin-PackingProblem.html.

Cited By

View all
  • (2023)Rank-Polymorphism for Shape-Guided BlockingProceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3609024.3609410(1-14)Online publication date: 30-Aug-2023
  • (2022)Micro-kernels for portable and efficient matrix multiplication in deep learningThe Journal of Supercomputing10.1007/s11227-022-05003-379:7(8124-8147)Online publication date: 14-Dec-2022
  • (2021)Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processorsThe Journal of Supercomputing10.1007/s11227-021-03636-477:10(11257-11269)Online publication date: 1-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 34, Issue 2
March 2008
143 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/1326548
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2008
Accepted: 01 April 2007
Revised: 01 April 2007
Received: 01 October 2006
Published in TOMS Volume 34, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FLAME
  2. OpenMP
  3. SMP
  4. parallel
  5. scalability
  6. workqueuing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Rank-Polymorphism for Shape-Guided BlockingProceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3609024.3609410(1-14)Online publication date: 30-Aug-2023
  • (2022)Micro-kernels for portable and efficient matrix multiplication in deep learningThe Journal of Supercomputing10.1007/s11227-022-05003-379:7(8124-8147)Online publication date: 14-Dec-2022
  • (2021)Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processorsThe Journal of Supercomputing10.1007/s11227-021-03636-477:10(11257-11269)Online publication date: 1-Oct-2021
  • (2014)Effectively Exploiting Parallel Scale for All Problem Sizes in LU FactorizationProceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium10.1109/IPDPS.2014.109(1039-1048)Online publication date: 19-May-2014
  • (2013)Scaling LAPACK panel operations using parallel cache assignmentACM Transactions on Mathematical Software10.1145/2491491.249149239:4(1-30)Online publication date: 23-Jul-2013
  • (2011)MR3-SMPParallel Computing10.1016/j.parco.2011.10.00137:12(795-805)Online publication date: 1-Dec-2011
  • (2010)Scaling LAPACK panel operations using parallel cache assignmentACM SIGPLAN Notices10.1145/1837853.169348445:5(223-232)Online publication date: 9-Jan-2010
  • (2010)Scaling LAPACK panel operations using parallel cache assignmentProceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/1693453.1693484(223-232)Online publication date: 9-Jan-2010
  • (2008)SuperMatrixProceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming10.1145/1345206.1345227(123-132)Online publication date: 20-Feb-2008
  • (2008)Performance evaluation of a multi-zone application in different OpenMP approachesInternational Journal of Parallel Programming10.1007/s10766-008-0074-536:3(312-325)Online publication date: 1-Jun-2008
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media