Abstract
In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. Additionally, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. To run well on such machines, software must exploit these features. We believe it is too onerous for a programmer to do this by hand, so we have been exploring the use of restructuring compiler technology for this purpose. In this article, we start with a language like HPF-Fortran with user-specified data distribution and develop a systematic loop transformation strategy called access normalization that restructures loop nests to exploit locality and block transfers. We demonstrate the power of our techniques using routines from the BLAS (Basic Linear Algebra Subprograms) library. An important feature of our approach is that we model loop transformation using invertible matrices and integer lattice theory.
- AGARWAL, A. 1991. Limits on interconnect~on network performance. IEEE Trans. Parall. Dtstrib. Syst. 2, 4, 398-412 Google Scholar
- ALLEN, R., AND KENNEDY, K. 1987. Automatic translation of FORTRAN programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491-542. Google Scholar
- ALLEN, F., COCKE, J., AND KENNEDY, K. 1981. Reductton of Operator Strength. Prentice-Hall, Englewood Cliffs, N J, 79-101.Google Scholar
- ANCOURT, C., AND IRIGOIN, F. 1991. Scanning polyhedra with DO loops in The 3rd ACM Symposium on Princzples and Practice o/Parallel Programming. ACM, New York, 39 50. Google Scholar
- BALASUNDARAM, V., Fox, G., KENNEDY, K., AND KREMER, U. 1991. An iteractive environment for data partitioning and distribution. In Proceedings of the 5lh Distributed Memory Computing Conference. IEEE, New York.Google Scholar
- BANEPdEE, U. 1990. Unimodular transformations of double loops. In Proceedings of the Workshop on Advances in Languages and Compilers for Parallel Processing. 192 219Google Scholar
- BANEPdEE, U. 1988. Dependence Analysis for Supercomputzng. Kluwer Academic, New York. Google Scholar
- BBN ADVANCED COMPUTERS INC. 1989. Buttez~y GPIO00 Sw~tch Tutortal. BBN, Cambridge, MassGoogle Scholar
- CALLAHAN D., AND KENNEDY, K. 1988. Compiling programs for distributed memory multlprocessors, d. Supercomput. 2, 2 (Oct.).Google Scholar
- COLEMAN, T., AND VAN LOAN C. 1988. Handbook for Matrzx Computations SIAM, Phi}adelphia.Google Scholar
- GANNON, D., JALBY, W., AND GALHVAN, K. 1988. Strategies for cache and local memory management by global program transformations J. Parall. Dlstrzb. Comput. 5, 587-616. Google Scholar
- GERNDT, H.M. 1989. Automatic parallebzatmn for distributed-memory multlprocessing systems. Ph.D. thesis, Bonn Univ., Bonn, Germany.Google Scholar
- HIRANANDAN}, S., KENNEDY, K., AND TSENG, C 1991. Compiler optimizations for FORTRAN-D on MIMD distributed-memory machines. Tech. Rep. TR91-156, Rice Umv.Google Scholar
- HUDAK, D., AND ABRAHAM, S. 1991 Compiler techniques for data parititioning of sequentially iterated parallel loops. In Proceedings of the ACM International Conference on Supercomputing. ACM, New York. Google Scholar
- IRIGOIN, F., AND TRIOLET, R. 1988. Supernode partitioning. In Proceedings~ o/the 15th Annual ACM Symposlum on Principles of Programmtng Languages. ACM, New York Google Scholar
- I~NDALL SQUARE RESEARCU CORPORATION 1991. Parallel Programming Manual. Waltham, Mass.Google Scholar
- KNOBE, K., LUKAS, J., AND STEELE, G. 1990. Data optimization: Allocation of arrays to reduce communication on SIMD machines. J. Parall Distr~b. Comput. 8, (Feb.), 102 118 Google Scholar
- KOELBEL AND MEHRO?RA, P. 1991. Compiling global namespace parallel loops for distributed execution. IEEE Trans. Parall. Dzstrib. Svst. 2, (Oct.). Google Scholar
- KUMAR, K. G., KULKARNI, D., AND BASU, A. 1991 Generalized unimodular loop transformtions for distributed memory multiprocessors, Tech Rep FG-TR-014, Center for Development of Advanced Computing, Bangalore, India.Google Scholar
- LAMPORT. L. 1994. The parallel execution of do loops. Commun. ACM 27, 2 (Feb.) 83-93. Google Scholar
- LI, J., AND CHEN, M. 1989. Index domain alignment: Minimizing cost of cross-referencing between distributed arrays Tech. Rep., Yale Univ.Google Scholar
- LI, W., AND PINGALI, K. 1992a. Access normahzation: Loop restructuring for NUMA compilers. Tech. Rep. 92-1278, Department of Computer Science, Cornell Univ., Ithaca, N Y. Google Scholar
- Li, W. AND PINGALi, K. 1992b. A singular loop transformation framework based on non-singular matrices. Tech. Rep. 92-1294, Dept. of Computer Science, Corne}l Univ., Ithaca, N.Y. Google Scholar
- LU, L. 1991. A unifed framework for systematic loop transformations. In The 3rd ACM SIGPLAN Symposium on Prlnctples and Practice of Parallel programmzng. ACM, New York 28-38. Google Scholar
- MIDKIFF, S. P., AND D. A. PADUA. 1987. Compiler algorithms for synchronization. IEEE Trans. Comput. C-36, (Dec.), 1485 1495. Google Scholar
- MIRCHANDANEY, R., SALTZ, J., SMITH, R., NICOL, D., AND CROWLEY, K. 1988. Principles of runtime support for parallel processors. In Proceedings of the 2nd International Conference on Supercomputlng. Google Scholar
- PADUA, n., AND WOLFE, M. 1980. Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12 (Dec.), 1184 1201. Google Scholar
- PORTERFIELD, A. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D. thesis, Rice Univ. Google Scholar
- RAMANUJAM, J., AND SADAYAPPAN, P. 1991. Compile-time techniques for data distribution in distributed memory machines. IEEE Trans. Parall. Distrib. Syst. 2 (Oct.). Google Scholar
- ROGERS, A. 1990. Compiling for locality of reference. Ph.D. thesis, Cornell Univ., Ithaca, N.Y. Google Scholar
- ROGERS, h., AND PINGALI, K. 1989. Process decomposition through locality of reference. In Proceedings of the 1989 SIGPLAN Conference on Programming Language Design and Implementatton. ACM, New York. Google Scholar
- SCHREIBER, R., AND DONGARRA, J. 1990. Automatic blocking of nested loops. Tech. Rep. 90.38, NASA RIACS.Google Scholar
- TSENG, P. 1989. A parallelizing compiler for distributed memory parallel computers. Ph.D. thesis, Carnegie-Mellon Univ., Pittsburgh, Penn. Google Scholar
- WHITFIELD, n., AND SOFFA, M.L. 1991. Automatic generation of globa} optimizers. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation. SIGPLAN Not. (June). Google Scholar
- WOLF M., AND LAM, M. 1991a. A data locality optimizing algorithm. In Proceedings ACM SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York, 30-44. Google Scholar
- WOLF, M., AND LAM, M. 1991b. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parall. Distrib. Syst. 13~ 4 (Oct.). Google Scholar
- WOLFE, M. 1989. Optimizing Supercompilers for Supercomputers. Pitman Publishing, London. Google Scholar
- ZIMA, H., AND C~ArMAN, B. 1990. Supercomptlers for Parallel and Vector Computers. ACM Press Frontier Series, New York. Google Scholar
Index Terms
- Access normalization: loop restructuring for NUMA computers
Recommendations
Reuse-Driven Tiling for Improving Data Locality
This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data ...
Optimization Techniques for Parallel Codes of Irregular Scientific Computations
ICPPW '02: Proceedings of the 2002 International Conference on Parallel Processing WorkshopsIn this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. For an irregular loop with nonlinear array subscripts, the loop is transformed to a normalized single loop, ...
Extracting coarse-grained parallelism for affine perfectly nested quasi-uniform loops
PPAM'11: Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part IThis paper presents a new approach for the extraction of coarse---grained parallelism available in program loops. The approach permits for extracting parallelism for both uniform and quasi---uniform perfectly nested parameterized loops, where the loop ...
Comments