skip to main content
10.1145/223982.224442acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article
Free Access

Optimizing memory system performance for communication in parallel computers

Authors Info & Claims
Published:01 May 1995Publication History

ABSTRACT

Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of patterns as part of communication steps, e.g., regular (i.e. blocks of data), strided, or irregular (indexed) memory accesses. The effective speed of these communication steps is determined by the network bandwidth and the memory bandwidth, and measurements on current parallel supercomputers indicate that the performance is limited by the memory bandwidth rather than the network bandwidth.Current systems provide a wealth of options to perform communication, and a compiler or user is faced with the difficulty of finding the communication operations that best use the available memory and network bandwidth. This paper provides a framework to evaluate different solutions for inter-node communication and presents the copy-transfer model; this model captures the contributions of the memory system to inter-node communication. We demonstrate the usefulness of this simple model by applying it to two commercial parallel systems, the Cray T3D and the Intel Paragon.In particular we identify two methods to transfer data between nodes in these two machines. In buffer-packing transfers, a contiguous block of data is transferred across the network. If the data are not stored contiguously, they are copied to (gathering) or from (scattering) buffers in local memory before and after the transfer. Chained transfers perform gathering, transfer and scattering in one step, reading the data elements with some non-sequential pattern and immediately transferring them on to the destination.Our model and measurements indicate that chaining of the gather, transfer, and scatter operations results in better performance than buffer packing for many important access patterns. Most standard message passing libraries (like MPI, PVM or NX) force the parallelizing compiler (or the programmer) to employ the buffer-packing communication operations. However, the addition of hardware support dedicated to communication (e.g., DMAs, line-transfer units) now gives the compiler a wider range of options.

References

  1. 1.D Adams. Gray T3D System Architecture Overview. Technical report, Gray Research Inc., September I993. Revision 1.C.Google ScholarGoogle Scholar
  2. 2.G. Blelloch and J. Sipelsteln. Collectlon-Oriented Languages. Proc. IEEE, 79(4)'504-523, Apr 199 IGoogle ScholarGoogle Scholar
  3. 3.Intel Corp. ParagonX/PS Product Overview Intel Corp., March I991Google ScholarGoogle Scholar
  4. 4.Gray Research Inc. GRAY T3D Apphcations Programmtng Course, Nov 1993 TR-T3DAPPL.Google ScholarGoogle Scholar
  5. 5.High Performance Fortran Forum. High Performance Fortran language specification version 1.0 draft, January 1993.Google ScholarGoogle Scholar
  6. 6.T. Gross, D. O'Hallaron, and J. Subhlok. Task Parallelism in a High Performance Fortran Framework. IEEE Parallel and Distributed Technology, 2(3): 16-26, Fall 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.K. Hayashi, T. Doi, T. Horie, Y. Koyanagl, O. Shiraki, N. Imamura, T. Shimizu, H. Ishihata, and T. Shindo. Ap 1000+: Architectural Support of a put/get Interface for Parallelizing Compilers. In Proc. of ASPLOS IV, pages 196-207. ACM, Oct 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.S. Hinrichs, C. Kosak, D. O'Hallaron, T Stricker, and R. Take An Architecture for Optimal All-to-All Personalized Communication. In ACM Symposium on Parallel Algorithms and Architectures, pages 310-319, Cape May, New Jersey, June 1994. A revised version is available as Tech. Report CMU-CS-94-140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.C. Leiserson, A. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill, D. Hillis, B. Kuszmaul, M. St.Pierre, D. Wells, M. Wong, S. Yang, and R. Zak. The Network Architecture of the Connection Machine CM-5. In Symposium on Parallel Algorithms and Architectures, pages 272-285, San Diego, June 1992 ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.A. B. Maccabe, K. S. McCurley, R. Riesen, and S R. Wheat. SUNMOS for the Intel Paragon: A Brief User's Guide. In Proceedings of the lntel Supercomputer Users' Group 1994 Annual North America Users' Con{erence , pages 245-251, June 1994. ftp.cs.sandia.gov/pub/sunmos/paperslpublished/ISUG94-1.ps.Google ScholarGoogle ScholarCross RefCross Ref
  11. 11.G. McRae, W. Goodin, and J Seinfeld. Development of a Second-Generation Mathematical Model for Urban Air Pollution - Model Formulation. Atmospheric Environment, 16(4):679- 696, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  12. 12.R. Numrich, E Springer, and J. Peterson. Measurement of Communication Rates on the Gray T3D Interproeessor Network. In Proc. HPCN Europe '94, Vol. ii, pages 150-157, Munich, April 1994. Springer Verlag. Lecture Notes in Computer Science, Vol. 797. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.W. Oed. The Gray Research Massively Parallel Processor System Gray T3D, 1993. Available from via ftp from tray.com.Google ScholarGoogle Scholar
  14. 14.E. J. Schwabe, G. E. Blelloch, A. Feldmann, O. Ghattas, J. R. Gilbert, G. L. Miller, D. R. O'Hallaron, J. R. Shewchuk, and S. Teng. A Separator-Based Framework for Automated Partitioning and Mapping of Parallel Algorithms for Numerical Solution of PDEs. In Proceedings o/the 1992 DAGSAOC Symposium, pages 48-62, June 1992. Revised version accepted for Comm. ACM.Google ScholarGoogle Scholar
  15. 15.J. Stichnoth, D. O'Hallaron, and T. Gross. Generating Communication for Array Statements: Design, Implementation, and Evaluation. Journal o.1' Parallel and Dtstributed Computing, 21(1):150-159, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.T. Stricker, J. Stichnoth, D. O'Hallaron, S. Hinnchs, and T. Gross. The Performance Impact of Fast Synchronizatxon m Parallel Computers To appear m Proceedings of International Conference of Supercomputing, Barcelona, Spain, July 1995.Google ScholarGoogle Scholar
  17. 17.T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages. a Mechanism for Integrated Communication and Computation. In Proc. 19th Intl. ConL on Computer Architecture, pages 256-266, May 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing memory system performance for communication in parallel computers

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture
                  July 1995
                  426 pages
                  ISBN:0897916980
                  DOI:10.1145/223982
                  • cover image ACM SIGARCH Computer Architecture News
                    ACM SIGARCH Computer Architecture News  Volume 23, Issue 2
                    Special Issue: Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95)
                    May 1995
                    412 pages
                    ISSN:0163-5964
                    DOI:10.1145/225830
                    Issue’s Table of Contents

                  Copyright © 1995 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 May 1995

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  Overall Acceptance Rate543of3,203submissions,17%

                  Upcoming Conference

                  ISCA '24

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader