Optimizing memory system performance for communication in parallel computers

Authors:
T. Stricker

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
View Profile

,
T. Gross

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA and Institut fuer Computer Systeme, ETH Zuerich, CH 8092 Zuerich, Switzerland

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA and Institut fuer Computer Systeme, ETH Zuerich, CH 8092 Zuerich, Switzerland
View Profile

ISCA '95: Proceedings of the 22nd annual international symposium on Computer architectureJuly 1995Pages 308–319https://doi.org/10.1145/223982.224442

Published:01 May 1995Publication History

ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture

Pages 308–319

ABSTRACT

Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of patterns as part of communication steps, e.g., regular (i.e. blocks of data), strided, or irregular (indexed) memory accesses. The effective speed of these communication steps is determined by the network bandwidth and the memory bandwidth, and measurements on current parallel supercomputers indicate that the performance is limited by the memory bandwidth rather than the network bandwidth.Current systems provide a wealth of options to perform communication, and a compiler or user is faced with the difficulty of finding the communication operations that best use the available memory and network bandwidth. This paper provides a framework to evaluate different solutions for inter-node communication and presents the copy-transfer model; this model captures the contributions of the memory system to inter-node communication. We demonstrate the usefulness of this simple model by applying it to two commercial parallel systems, the Cray T3D and the Intel Paragon.In particular we identify two methods to transfer data between nodes in these two machines. In buffer-packing transfers, a contiguous block of data is transferred across the network. If the data are not stored contiguously, they are copied to (gathering) or from (scattering) buffers in local memory before and after the transfer. Chained transfers perform gathering, transfer and scattering in one step, reading the data elements with some non-sequential pattern and immediately transferring them on to the destination.Our model and measurements indicate that chaining of the gather, transfer, and scatter operations results in better performance than buffer packing for many important access patterns. Most standard message passing libraries (like MPI, PVM or NX) force the parallelizing compiler (or the programmer) to employ the buffer-packing communication operations. However, the addition of hardware support dedicated to communication (e.g., DMAs, line-transfer units) now gives the compiler a wider range of options.

References

1.D Adams. Gray T3D System Architecture Overview. Technical report, Gray Research Inc., September I993. Revision 1.C.Google Scholar
2.G. Blelloch and J. Sipelsteln. Collectlon-Oriented Languages. Proc. IEEE, 79(4)'504-523, Apr 199 IGoogle Scholar
3.Intel Corp. ParagonX/PS Product Overview Intel Corp., March I991Google Scholar
4.Gray Research Inc. GRAY T3D Apphcations Programmtng Course, Nov 1993 TR-T3DAPPL.Google Scholar
5.High Performance Fortran Forum. High Performance Fortran language specification version 1.0 draft, January 1993.Google Scholar
6.T. Gross, D. O'Hallaron, and J. Subhlok. Task Parallelism in a High Performance Fortran Framework. IEEE Parallel and Distributed Technology, 2(3): 16-26, Fall 1994. Google ScholarDigital Library
7.K. Hayashi, T. Doi, T. Horie, Y. Koyanagl, O. Shiraki, N. Imamura, T. Shimizu, H. Ishihata, and T. Shindo. Ap 1000+: Architectural Support of a put/get Interface for Parallelizing Compilers. In Proc. of ASPLOS IV, pages 196-207. ACM, Oct 1994. Google ScholarDigital Library
8.S. Hinrichs, C. Kosak, D. O'Hallaron, T Stricker, and R. Take An Architecture for Optimal All-to-All Personalized Communication. In ACM Symposium on Parallel Algorithms and Architectures, pages 310-319, Cape May, New Jersey, June 1994. A revised version is available as Tech. Report CMU-CS-94-140. Google ScholarDigital Library
9.C. Leiserson, A. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill, D. Hillis, B. Kuszmaul, M. St.Pierre, D. Wells, M. Wong, S. Yang, and R. Zak. The Network Architecture of the Connection Machine CM-5. In Symposium on Parallel Algorithms and Architectures, pages 272-285, San Diego, June 1992 ACM. Google ScholarDigital Library
10.A. B. Maccabe, K. S. McCurley, R. Riesen, and S R. Wheat. SUNMOS for the Intel Paragon: A Brief User's Guide. In Proceedings of the lntel Supercomputer Users' Group 1994 Annual North America Users' Con{erence , pages 245-251, June 1994. ftp.cs.sandia.gov/pub/sunmos/paperslpublished/ISUG94-1.ps.Google ScholarCross Ref
11.G. McRae, W. Goodin, and J Seinfeld. Development of a Second-Generation Mathematical Model for Urban Air Pollution - Model Formulation. Atmospheric Environment, 16(4):679- 696, 1982.Google ScholarCross Ref
12.R. Numrich, E Springer, and J. Peterson. Measurement of Communication Rates on the Gray T3D Interproeessor Network. In Proc. HPCN Europe '94, Vol. ii, pages 150-157, Munich, April 1994. Springer Verlag. Lecture Notes in Computer Science, Vol. 797. Google ScholarDigital Library
13.W. Oed. The Gray Research Massively Parallel Processor System Gray T3D, 1993. Available from via ftp from tray.com.Google Scholar
14.E. J. Schwabe, G. E. Blelloch, A. Feldmann, O. Ghattas, J. R. Gilbert, G. L. Miller, D. R. O'Hallaron, J. R. Shewchuk, and S. Teng. A Separator-Based Framework for Automated Partitioning and Mapping of Parallel Algorithms for Numerical Solution of PDEs. In Proceedings o/the 1992 DAGSAOC Symposium, pages 48-62, June 1992. Revised version accepted for Comm. ACM.Google Scholar
15.J. Stichnoth, D. O'Hallaron, and T. Gross. Generating Communication for Array Statements: Design, Implementation, and Evaluation. Journal o.1' Parallel and Dtstributed Computing, 21(1):150-159, 1994. Google ScholarDigital Library
16.T. Stricker, J. Stichnoth, D. O'Hallaron, S. Hinnchs, and T. Gross. The Performance Impact of Fast Synchronizatxon m Parallel Computers To appear m Proceedings of International Conference of Supercomputing, Barcelona, Spain, July 1995.Google Scholar
17.T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages. a Mechanism for Integrated Communication and Computation. In Proc. 19th Intl. ConL on Computer Architecture, pages 256-266, May 1992. Google ScholarDigital Library

Index Terms

Recommendations

Optimizing memory system performance for communication in parallel computers
Special Issue: Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95)

Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of ...
Read More
High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers

In this paper, we propose high-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. We use the four-step or six-step FFT algorithms to implement the radix-2, 3 and 5 parallel 1-D complex FFT ...
Read More
Measuring the performance of parallel computers with distributed memory

Basic techniques for measuring the performance of parallel computers with distributed memory are considered. The results obtained via the de-facto standard LINPACK benchmark suite are shown to be weakly related to the efficiency of applied parallel ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture
July 1995
426 pages
ISBN:0897916980
DOI:10.1145/223982
Chairman:
David A. Patterson
Univ. of California, Berkeley
ACM SIGARCH Computer Architecture News Volume 23, Issue 2
Special Issue: Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95)
May 1995
412 pages
ISSN:0163-5964
DOI:10.1145/225830
Chairman:
David A. Patterson
Univ. of California, Berkeley
Issue’s Table of Contents
Copyright © 1995 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 1995
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 504
  Total Downloads
- Downloads (Last 12 months)71
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing memory system performance for communication in parallel computers

ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing memory system performance for communication in parallel computers

High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers

Measuring the performance of parallel computers with distributed memory