skip to main content
10.1145/1065895.1065898acmconferencesArticle/Chapter ViewAbstractPublication PagesmspConference Proceedingsconference-collections
Article

Automatic blocking of QR and LU factorizations for locality

Published: 08 June 2004 Publication History

Abstract

QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.

References

[1]
N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000 ACM International Conference on Supercomputing, Santa Fe, New Mexico, May 2000.
[2]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco, October 2001.
[3]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. The Society for Industrial and Applied Mathematics, 1999.
[4]
S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing, Minneapolis, Nov. 1992.
[5]
S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, 1994.
[6]
S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations. ACM Transactions on Mathematical Software, 23(3), 1997.
[7]
L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical Tiling for Improved Superscalar Performance. In Proc. 9th International Parallel Processing Symposium, Santa Barbara, CA, Apr. 1995.
[8]
S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.
[9]
J. Dongarra, J. Bunch, C. Moler, and G. Stewart. LINPACK Users' Guide. Society for Industrial and Applied Mathematics, 1979.
[10]
J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, Jan. 1984.
[11]
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5(5):587--616, Oct. 1988.
[12]
W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Interface Guide. Technical report, Dept. of Computer Science, Univ. of Maryland, College Park, Apr. 1996.
[13]
K. Kennedy. Fast greedy weighted fusion. In Proceedings of the International Conference on Supercomputing, Santa Fe, NM, May 2000.
[14]
K. Kennedy and K. S. McKinley. Typed fusion with applications to parallel and sequential code generation. Technical Report TR93-208, Dept. of Computer Science, Rice University, Aug. 1993. (also available as CRPC-TR94370).
[15]
I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997.
[16]
M. Lam, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, Apr. 1991.
[17]
A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing, Rhodes, Greece, June 1999.
[18]
K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, July 1996.
[19]
N. Mitchell, L. Carter, J. Ferrante, and K. Hgstedt. Quantifying the multi-level nature of tiling interactions. In 10th International Workshop on Languages and Compilers for Parallel Computing, August 1997.
[20]
W. Pugh. Uniform techniques for loop optimization. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991.
[21]
G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998.
[22]
William Pugh and Evan Rosser. Iteration Space Slicing For Locality. In LCPC 99, July 1999.
[23]
M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, Toronto, June 1991.
[24]
M. J. Wolfe. Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA, 1989.
[25]
Q. Yi, K. Kennedy, and V. Adve. Transforming complex loop nests for locality. The Journal Of Supercomputing, 27:219--264, 2004.

Cited By

View all
  • (2018)Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware AcceleratorsProceedings of the IEEE10.1109/JPROC.2018.2868961106:11(2040-2055)Online publication date: Nov-2018
  • (2014)Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivotingConcurrency and Computation: Practice & Experience10.1002/cpe.311026:7(1408-1431)Online publication date: 1-May-2014
  • (2013)Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix AccessIEICE Transactions on Information and Systems10.1587/transinf.E96.D.2765E96.D:12(2765-2775)Online publication date: 2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSP '04: Proceedings of the 2004 workshop on Memory system performance
June 2004
70 pages
ISBN:1581139411
DOI:10.1145/1065895
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. LAPACK
  2. LU
  3. QR
  4. blocking
  5. locality
  6. loop optimizations

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 6 of 20 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware AcceleratorsProceedings of the IEEE10.1109/JPROC.2018.2868961106:11(2040-2055)Online publication date: Nov-2018
  • (2014)Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivotingConcurrency and Computation: Practice & Experience10.1002/cpe.311026:7(1408-1431)Online publication date: 1-May-2014
  • (2013)Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix AccessIEICE Transactions on Information and Systems10.1587/transinf.E96.D.2765E96.D:12(2765-2775)Online publication date: 2013
  • (2013)High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architecturesACM Transactions on Mathematical Software10.1145/2450153.245015439:3(1-22)Online publication date: 3-May-2013
  • (2012)A High Performance and Memory Efficient LU Decomposer on FPGAsIEEE Transactions on Computers10.1109/TC.2010.27861:3(366-378)Online publication date: 1-Mar-2012
  • (2011)Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore ArchitecturesProceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2011.91(944-955)Online publication date: 16-May-2011
  • (2010)LU decomposition on cell broadband engineProceedings of the 2010 IFIP international conference on Network and parallel computing10.5555/1882011.1882020(61-75)Online publication date: 13-Sep-2010
  • (2010)LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip MultiprocessorsNetwork and Parallel Computing10.1007/978-3-642-15672-4_7(61-75)Online publication date: 2010
  • (2008)A comparison of search heuristics for empirical code optimization2008 IEEE International Conference on Cluster Computing10.1109/CLUSTR.2008.4663803(421-429)Online publication date: Sep-2008
  • (2007)Applying Out-of-Core QR Decomposition Algorithms on FPGA-Based Systems2007 International Conference on Field Programmable Logic and Applications10.1109/FPL.2007.4380630(86-91)Online publication date: Aug-2007
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media