article

Free Access

Access normalization: loop restructuring for NUMA computers

Authors Info & Claims

ACM Transactions on Computer Systems Volume 11 Issue 4pp 353–375https://doi.org/10.1145/161541.159766

Published:01 November 1993Publication History

ACM Transactions on Computer Systems

Abstract

In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. Additionally, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. To run well on such machines, software must exploit these features. We believe it is too onerous for a programmer to do this by hand, so we have been exploring the use of restructuring compiler technology for this purpose. In this article, we start with a language like HPF-Fortran with user-specified data distribution and develop a systematic loop transformation strategy called access normalization that restructures loop nests to exploit locality and block transfers. We demonstrate the power of our techniques using routines from the BLAS (Basic Linear Algebra Subprograms) library. An important feature of our approach is that we model loop transformation using invertible matrices and integer lattice theory.

References

AGARWAL, A. 1991. Limits on interconnect~on network performance. IEEE Trans. Parall. Dtstrib. Syst. 2, 4, 398-412 Google Scholar
ALLEN, R., AND KENNEDY, K. 1987. Automatic translation of FORTRAN programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491-542. Google Scholar
ALLEN, F., COCKE, J., AND KENNEDY, K. 1981. Reductton of Operator Strength. Prentice-Hall, Englewood Cliffs, N J, 79-101.Google Scholar
ANCOURT, C., AND IRIGOIN, F. 1991. Scanning polyhedra with DO loops in The 3rd ACM Symposium on Princzples and Practice o/Parallel Programming. ACM, New York, 39 50. Google Scholar
BALASUNDARAM, V., Fox, G., KENNEDY, K., AND KREMER, U. 1991. An iteractive environment for data partitioning and distribution. In Proceedings of the 5lh Distributed Memory Computing Conference. IEEE, New York.Google Scholar
BANEPdEE, U. 1990. Unimodular transformations of double loops. In Proceedings of the Workshop on Advances in Languages and Compilers for Parallel Processing. 192 219Google Scholar
BANEPdEE, U. 1988. Dependence Analysis for Supercomputzng. Kluwer Academic, New York. Google Scholar
BBN ADVANCED COMPUTERS INC. 1989. Buttez~y GPIO00 Sw~tch Tutortal. BBN, Cambridge, MassGoogle Scholar
CALLAHAN D., AND KENNEDY, K. 1988. Compiling programs for distributed memory multlprocessors, d. Supercomput. 2, 2 (Oct.).Google Scholar
COLEMAN, T., AND VAN LOAN C. 1988. Handbook for Matrzx Computations SIAM, Phi}adelphia.Google Scholar
GANNON, D., JALBY, W., AND GALHVAN, K. 1988. Strategies for cache and local memory management by global program transformations J. Parall. Dlstrzb. Comput. 5, 587-616. Google Scholar
GERNDT, H.M. 1989. Automatic parallebzatmn for distributed-memory multlprocessing systems. Ph.D. thesis, Bonn Univ., Bonn, Germany.Google Scholar
HIRANANDAN}, S., KENNEDY, K., AND TSENG, C 1991. Compiler optimizations for FORTRAN-D on MIMD distributed-memory machines. Tech. Rep. TR91-156, Rice Umv.Google Scholar
HUDAK, D., AND ABRAHAM, S. 1991 Compiler techniques for data parititioning of sequentially iterated parallel loops. In Proceedings of the ACM International Conference on Supercomputing. ACM, New York. Google Scholar
IRIGOIN, F., AND TRIOLET, R. 1988. Supernode partitioning. In Proceedings~ o/the 15th Annual ACM Symposlum on Principles of Programmtng Languages. ACM, New York Google Scholar
I~NDALL SQUARE RESEARCU CORPORATION 1991. Parallel Programming Manual. Waltham, Mass.Google Scholar
KNOBE, K., LUKAS, J., AND STEELE, G. 1990. Data optimization: Allocation of arrays to reduce communication on SIMD machines. J. Parall Distr~b. Comput. 8, (Feb.), 102 118 Google Scholar
KOELBEL AND MEHRO?RA, P. 1991. Compiling global namespace parallel loops for distributed execution. IEEE Trans. Parall. Dzstrib. Svst. 2, (Oct.). Google Scholar
KUMAR, K. G., KULKARNI, D., AND BASU, A. 1991 Generalized unimodular loop transformtions for distributed memory multiprocessors, Tech Rep FG-TR-014, Center for Development of Advanced Computing, Bangalore, India.Google Scholar
LAMPORT. L. 1994. The parallel execution of do loops. Commun. ACM 27, 2 (Feb.) 83-93. Google Scholar
LI, J., AND CHEN, M. 1989. Index domain alignment: Minimizing cost of cross-referencing between distributed arrays Tech. Rep., Yale Univ.Google Scholar
LI, W., AND PINGALI, K. 1992a. Access normahzation: Loop restructuring for NUMA compilers. Tech. Rep. 92-1278, Department of Computer Science, Cornell Univ., Ithaca, N Y. Google Scholar
Li, W. AND PINGALi, K. 1992b. A singular loop transformation framework based on non-singular matrices. Tech. Rep. 92-1294, Dept. of Computer Science, Corne}l Univ., Ithaca, N.Y. Google Scholar
LU, L. 1991. A unifed framework for systematic loop transformations. In The 3rd ACM SIGPLAN Symposium on Prlnctples and Practice of Parallel programmzng. ACM, New York 28-38. Google Scholar
MIDKIFF, S. P., AND D. A. PADUA. 1987. Compiler algorithms for synchronization. IEEE Trans. Comput. C-36, (Dec.), 1485 1495. Google Scholar
MIRCHANDANEY, R., SALTZ, J., SMITH, R., NICOL, D., AND CROWLEY, K. 1988. Principles of runtime support for parallel processors. In Proceedings of the 2nd International Conference on Supercomputlng. Google Scholar
PADUA, n., AND WOLFE, M. 1980. Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12 (Dec.), 1184 1201. Google Scholar
PORTERFIELD, A. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D. thesis, Rice Univ. Google Scholar
RAMANUJAM, J., AND SADAYAPPAN, P. 1991. Compile-time techniques for data distribution in distributed memory machines. IEEE Trans. Parall. Distrib. Syst. 2 (Oct.). Google Scholar
ROGERS, A. 1990. Compiling for locality of reference. Ph.D. thesis, Cornell Univ., Ithaca, N.Y. Google Scholar
ROGERS, h., AND PINGALI, K. 1989. Process decomposition through locality of reference. In Proceedings of the 1989 SIGPLAN Conference on Programming Language Design and Implementatton. ACM, New York. Google Scholar
SCHREIBER, R., AND DONGARRA, J. 1990. Automatic blocking of nested loops. Tech. Rep. 90.38, NASA RIACS.Google Scholar
TSENG, P. 1989. A parallelizing compiler for distributed memory parallel computers. Ph.D. thesis, Carnegie-Mellon Univ., Pittsburgh, Penn. Google Scholar
WHITFIELD, n., AND SOFFA, M.L. 1991. Automatic generation of globa} optimizers. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation. SIGPLAN Not. (June). Google Scholar
WOLF M., AND LAM, M. 1991a. A data locality optimizing algorithm. In Proceedings ACM SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York, 30-44. Google Scholar
WOLF, M., AND LAM, M. 1991b. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parall. Distrib. Syst. 13~ 4 (Oct.). Google Scholar
WOLFE, M. 1989. Optimizing Supercompilers for Supercomputers. Pitman Publishing, London. Google Scholar
ZIMA, H., AND C~ArMAN, B. 1990. Supercomptlers for Parallel and Vector Computers. ACM Press Frontier Series, New York. Google Scholar

Index Terms

Access normalization: loop restructuring for NUMA computers

Recommendations

Reuse-Driven Tiling for Improving Data Locality

This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data ...
Read More
Optimization Techniques for Parallel Codes of Irregular Scientific Computations
ICPPW '02: Proceedings of the 2002 International Conference on Parallel Processing Workshops

In this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. For an irregular loop with nonlinear array subscripts, the loop is transformed to a normalized single loop, ...
Read More
Extracting coarse-grained parallelism for affine perfectly nested quasi-uniform loops
PPAM'11: Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I

This paper presents a new approach for the extraction of coarse---grained parallelism available in program loops. The approach permits for extracting parallelism for both uniform and quasi---uniform perfectly nested parameterized loops, where the loop ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 11, Issue 4
Nov. 1993
109 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/161541
Issue’s Table of Contents

Copyright © 1993 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 1993
Published in tocs Volume 11, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data locality
loop transformation
nonsingular loop transformation
nonuniform memory access machines
parallelizing compilers
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 356
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Reuse-Driven Tiling for Improving Data Locality

Optimization Techniques for Parallel Codes of Irregular Scientific Computations

Extracting coarse-grained parallelism for affine perfectly nested quasi-uniform loops

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Reuse-Driven Tiling for Improving Data Locality

Optimization Techniques for Parallel Codes of Irregular Scientific Computations

Extracting coarse-grained parallelism for affine perfectly nested quasi-uniform loops

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media