skip to main content
10.1145/1122971.1122990acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article

Optimizing irregular shared-memory applications for distributed-memory systems

Published:29 March 2006Publication History

ABSTRACT

In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming to distributed-memory platforms by automatic translation of OpenMP programs to MPI. In the case of irregular applications, the performance of this translation scheme is limited by the fact that accesses to shared-data cannot be accurately resolved at compile-time. Additionally, irregular applications with high communication to computation ratios pose challenges even for direct implementation on message passing systems. In this paper, we present combined compile-time/run-time techniques for optimizing irregular shared-memory applications on message passing systems in the context of automatic translation from OpenMP to MPI. Our transformations enable computation-communication overlap by restructuring irregular parallel loops. The compiler creates inspectors to analyze actual data access patterns for irregular accesses at runtime. This analysis is combined with the compile-time analysis of regular data accesses to determine which iterations of irregular loops access non-local data. The iterations are then reordered to enable computation-communication overlap. In the case where the irregular access occurs inside nested loops, the loop nest is restructured. We evaluate our techniques by translating OpenMP versions of three benchmarks from two important classes of irregular applications - sparse matrix computations and molecular dynamics. We find that for these applications, on sixteen nodes, versions employing computation-communication overlap are almost twice as fast as baseline OpenMP-to-MPI versions, almost 30% faster than inspector-only versions, almost 25% faster than hand-coded versions on two applications and about 9% slower on the third.

References

  1. T. S. Abdelrahman and G. Liu. Overlap of computation and communication on shared-memory networks-of-workstations. Cluster computing, pages 35--45, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Adiga, G. Almasi, G. Almasi, Y. Aridor, R. Barik, D. Beece, R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, and J. An overview of the BlueGene/L supercomputer. In SC2002 -- High Performance Networking and Computing, Baltimore, MD, November 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Almási, R. Bellofatto, J. Brunheroto, C. Caşcaval, J. G. C. nos, L. Ceze, P. Crumley, C. Erway, J. Gagliano, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, and K. Strauss. An overview of the BlueGene/L system software organization. In Proceedings of Euro-Par 2003 Conference, Lecture Notes in Computer Science, Klagenfurt, Austria, August 2003. Springer-Verlag.]]Google ScholarGoogle Scholar
  4. V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In Proc. of the Workshop on OpenMP Applications and Tools (WOMPAT2001), Lecture Notes in Computer Science, 2104, pages 1--10, July 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Basumallik and R. Eigenmann. Towards automatic translation of openmp to mpi. In ICS '05: Proceedings of the 19th annual International Conference on Supercomputing, pages 189--198, Cambridge, Massachusetts, USA, 2005. ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. Charmm: A program for macromolecular energy, minimization, and dynamics calculations. J. Comp. Chem., 4:187--217, 1983.]]Google ScholarGoogle ScholarCross RefCross Ref
  7. G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379--386, 1994.]]Google ScholarGoogle Scholar
  8. S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In ASPLOS-VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, pages 252--262, New York, NY, USA, 1994. ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Darema, D. A. George, V. A. Norton, and G. F. Pfister. A single-program-multiple-data computational model for epex/fortran. Parallel Computing, 7(1):11--24, 1988.]]Google ScholarGoogle ScholarCross RefCross Ref
  10. R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallel unstructured Euler solver using software primitives, AIAA-92-0562. In Proceedings of the 30th Aerospace Sciences Meeting, 1992.]]Google ScholarGoogle Scholar
  11. R. Das, M. Uysal, J. Saltz, and Y.-S. S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. Journal of Parallel and Distributed Computing, 22(3):462--478, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Das, J. Wu, J. Saltz, H. Berryman, and S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Trans. Comput., 44(6):737--753, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In PLDI '99: Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 229--241, New York, NY, USA, 1999. ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. P. I. Forum. MPI: A Message-Passing Interface Standard. Technical Report UT-CS-94-230, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. O. Forum. OpenMP: A Proposed Industry Standard API for Shared Memory Programming. Technical report, Oct. 1997.]]Google ScholarGoogle Scholar
  16. W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789--828, Sept. 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Han and C.-W. Tseng. A comparison of locality transformations for irregular codes. In LCR '00: Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 70--84, London, UK, 2000. Springer-Verlag.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3):350--360, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC-TR92225, Houston, Tex., 1993.]]Google ScholarGoogle Scholar
  21. P. N. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Liblit, G. Pike, and K. Yelick. Titanium language reference manual. Technical report, Berkeley, CA, USA, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Hisley, G. Agrawal, P. Satya-narayana, and L. Pollock. Porting and performance evaluation of irregular codes using OpenMP. Concurrency: Practice and Experience, 12(12):1241--1259, 2000.]]Google ScholarGoogle ScholarCross RefCross Ref
  23. E.-J. Im and K. A. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01: Proceedings of the International Conference on Computational Sciences-Part I, pages 127--136, London, UK, 2001. Springer-Verlag.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Ishizaki, H. Komatsu, and T. Nakatani. "a loop transformation algorithm for communication overlapping". International Journal of Parallel Programming, 28(2):135--154, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011.]]Google ScholarGoogle Scholar
  26. K. Kennedy and K. S. McKinley. Loop distribution with arbitrary control flow. In Supercomputing '90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 407--416, Washington, DC, USA, 1990. IEEE Computer Society.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Lain and P. Banerjee. Techniques to overlap computation and communication in irregular iterative applications. In ICS '94: Proceedings of the 8th international conference on Supercomputing, pages 236--245, New York, NY, USA, 1994. ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S.-I. Lee, T. A. Johnson, and R. Eigenmann. Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation. In Proc. of the Workshop on Languages and Compilers for Parallel Computing(LCPC'03), pages 539--553. (Springer-Verlag Lecture Notes in Computer Science), Oct. 2003.]]Google ScholarGoogle Scholar
  29. H. Lu, A. L. Cox, S. Dwarkadas, R. Rajamony, and W. Zwaenepoel. Compiler and software distributed shared memory support for irregular applications. In Proc. of the Sixth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP'97), pages 48--56, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S.-J. Min, A. Basumallik, and R. Eigenmann. Optimizing OpenMP programs on Software Distributed Shared Memory Systems. International Journal of Parallel Programming, 31(3):225--249, June 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In 1999 International Conference on Parallel Architectures and Compilation Techniques, pages 192--202, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Sass and M. Mutka. Enabling unimodular transformations. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 753--762, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Simon. Partitioning of unstructured problems for parallel processing. Computing Systems in Engineering, 2(2-3):135--148, 1991.]]Google ScholarGoogle ScholarCross RefCross Ref
  34. J. Su and K. Yelick. Automatic support for irregular computations in a high-level language. In Proceedings of the 19th International Parallel and Distributed Processing Symposium, Denver, Colorado, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. Tu and D. Padua. Array privatization for shared and distributed memory machines (extended abstract). SIGPLAN Not., 28(1):64--67, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Yoshida, K. Koshizuka, and H. Kasahara. Data-localization for fortran macro-dataflow computation using partial static task assignment. In ICS '96: Proceedings of the 10th international conference on Supercomputing, pages 61--68, Philadelphia, Pennsylvania, USA, 1996. ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing irregular shared-memory applications for distributed-memory systems

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
            March 2006
            258 pages
            ISBN:1595931899
            DOI:10.1145/1122971

            Copyright © 2006 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 29 March 2006

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate230of1,014submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader