Article

Automatic blocking of QR and LU factorizations for locality

Authors:

Jack DongarraAuthors Info & Claims

MSP '04: Proceedings of the 2004 workshop on Memory system performance

Pages 12 - 22

https://doi.org/10.1145/1065895.1065898

Published: 08 June 2004 Publication History

Abstract

QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.

References

[1]

N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000 ACM International Conference on Supercomputing, Santa Fe, New Mexico, May 2000.

Digital Library

[2]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco, October 2001.

Digital Library

[3]

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. The Society for Industrial and Applied Mathematics, 1999.

Digital Library

[4]

S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing, Minneapolis, Nov. 1992.

Digital Library

[5]

S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, 1994.

Digital Library

[6]

S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations. ACM Transactions on Mathematical Software, 23(3), 1997.

Digital Library

[7]

L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical Tiling for Improved Superscalar Performance. In Proc. 9th International Parallel Processing Symposium, Santa Barbara, CA, Apr. 1995.

Digital Library

[8]

S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.

Digital Library

[9]

J. Dongarra, J. Bunch, C. Moler, and G. Stewart. LINPACK Users' Guide. Society for Industrial and Applied Mathematics, 1979.

[10]

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, Jan. 1984.

Digital Library

[11]

D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5(5):587--616, Oct. 1988.

Digital Library

[12]

W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Interface Guide. Technical report, Dept. of Computer Science, Univ. of Maryland, College Park, Apr. 1996.

Digital Library

[13]

K. Kennedy. Fast greedy weighted fusion. In Proceedings of the International Conference on Supercomputing, Santa Fe, NM, May 2000.

Digital Library

[14]

K. Kennedy and K. S. McKinley. Typed fusion with applications to parallel and sequential code generation. Technical Report TR93-208, Dept. of Computer Science, Rice University, Aug. 1993. (also available as CRPC-TR94370).

[15]

I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997.

Digital Library

[16]

M. Lam, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, Apr. 1991.

Digital Library

[17]

A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing, Rhodes, Greece, June 1999.

Digital Library

[18]

K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, July 1996.

Digital Library

[19]

N. Mitchell, L. Carter, J. Ferrante, and K. Hgstedt. Quantifying the multi-level nature of tiling interactions. In 10th International Workshop on Languages and Compilers for Parallel Computing, August 1997.

Digital Library

[20]

W. Pugh. Uniform techniques for loop optimization. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991.

Digital Library

[21]

G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998.

Digital Library

[22]

William Pugh and Evan Rosser. Iteration Space Slicing For Locality. In LCPC 99, July 1999.

Digital Library

[23]

M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, Toronto, June 1991.

Digital Library

[24]

M. J. Wolfe. Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA, 1989.

Digital Library

[25]

Q. Yi, K. Kennedy, and V. Adve. Transforming complex loop nests for locality. The Journal Of Supercomputing, 27:219--264, 2004.

Digital Library

Cited By

Dongarra JGates MKurzak JLuszczek PTsai Y(2018)Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware AcceleratorsProceedings of the IEEE10.1109/JPROC.2018.2868961106:11(2040-2055)Online publication date: Nov-2018
https://doi.org/10.1109/JPROC.2018.2868961
Dongarra JFaverge MLtaief HLuszczek P(2014)Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivotingConcurrency and Computation: Practice & Experience10.1002/cpe.311026:7(1408-1431)Online publication date: 1-May-2014
https://dl.acm.org/doi/10.1002/cpe.3110
GUO LTANG YDOU YLEI YMA MZHOU J(2013)Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix AccessIEICE Transactions on Information and Systems10.1587/transinf.E96.D.2765E96.D:12(2765-2775)Online publication date: 2013
https://doi.org/10.1587/transinf.E96.D.2765
Show More Cited By

Index Terms

Automatic blocking of QR and LU factorizations for locality
1. Mathematics of computing
  1. Mathematical software
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques

Recommendations

Performance Predictions of Multilevel Communication Optimal LU and QR Factorizations on Hierarchical Platforms
ISC 2014: Proceedings of the 29th International Conference on Supercomputing - Volume 8488

In this paper we study the performance of two classical dense linear algebra algorithms, the LU and the QR factorizations, on multilevel hierarchical platforms. We note that we focus on multilevel QR factorization, and give a brief description of the ...
Implementing a parallel matrix factorization library on the cell broadband engine
High Performance Computing with the Cell Broadband Engine

Matrix factorization (or often called decomposition) is a frequently used kernel in a large number of applications ranging from linear solvers to data clustering and machine learning. The central contribution of this paper is a thorough performance ...
Scaling LAPACK panel operations using parallel cache assignment

In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent scaling, but panel ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MSP '04: Proceedings of the 2004 workshop on Memory system performance

June 2004

70 pages

ISBN:1581139411

DOI:10.1145/1065895

General Chair:
Chen Ding
University of Rochester
,
Program Chair:
Steve Blackburn

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 6 of 20 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
189
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dongarra JGates MKurzak JLuszczek PTsai Y(2018)Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware AcceleratorsProceedings of the IEEE10.1109/JPROC.2018.2868961106:11(2040-2055)Online publication date: Nov-2018
https://doi.org/10.1109/JPROC.2018.2868961
Dongarra JFaverge MLtaief HLuszczek P(2014)Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivotingConcurrency and Computation: Practice & Experience10.1002/cpe.311026:7(1408-1431)Online publication date: 1-May-2014
https://dl.acm.org/doi/10.1002/cpe.3110
GUO LTANG YDOU YLEI YMA MZHOU J(2013)Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix AccessIEICE Transactions on Information and Systems10.1587/transinf.E96.D.2765E96.D:12(2765-2775)Online publication date: 2013
https://doi.org/10.1587/transinf.E96.D.2765
Ltaief HLuszczek PDongarra J(2013)High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architecturesACM Transactions on Mathematical Software10.1145/2450153.245015439:3(1-22)Online publication date: 3-May-2013
https://dl.acm.org/doi/10.1145/2450153.2450154
Wu GDou YSun JPeterson G(2012)A High Performance and Memory Efficient LU Decomposer on FPGAsIEEE Transactions on Computers10.1109/TC.2010.27861:3(366-378)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.1109/TC.2010.278
Luszczek PLtaief HDongarra J(2011)Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore ArchitecturesProceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2011.91(944-955)Online publication date: 16-May-2011
https://dl.acm.org/doi/10.1109/IPDPS.2011.91
Mao FShen X(2010)LU decomposition on cell broadband engineProceedings of the 2010 IFIP international conference on Network and parallel computing10.5555/1882011.1882020(61-75)Online publication date: 13-Sep-2010
https://dl.acm.org/doi/10.5555/1882011.1882020
Mao FShen X(2010)LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip MultiprocessorsNetwork and Parallel Computing10.1007/978-3-642-15672-4_7(61-75)Online publication date: 2010
https://doi.org/10.1007/978-3-642-15672-4_7
Seymour KYou HDongarra J(2008)A comparison of search heuristics for empirical code optimization2008 IEEE International Conference on Cluster Computing10.1109/CLUSTR.2008.4663803(421-429)Online publication date: Sep-2008
https://doi.org/10.1109/CLUSTR.2008.4663803
Tai YLo CPsarris K(2007)Applying Out-of-Core QR Decomposition Algorithms on FPGA-Based Systems2007 International Conference on Field Programmable Logic and Applications10.1109/FPL.2007.4380630(86-91)Online publication date: Aug-2007
https://doi.org/10.1109/FPL.2007.4380630
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten