ABSTRACT
Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of patterns as part of communication steps, e.g., regular (i.e. blocks of data), strided, or irregular (indexed) memory accesses. The effective speed of these communication steps is determined by the network bandwidth and the memory bandwidth, and measurements on current parallel supercomputers indicate that the performance is limited by the memory bandwidth rather than the network bandwidth.Current systems provide a wealth of options to perform communication, and a compiler or user is faced with the difficulty of finding the communication operations that best use the available memory and network bandwidth. This paper provides a framework to evaluate different solutions for inter-node communication and presents the copy-transfer model; this model captures the contributions of the memory system to inter-node communication. We demonstrate the usefulness of this simple model by applying it to two commercial parallel systems, the Cray T3D and the Intel Paragon.In particular we identify two methods to transfer data between nodes in these two machines. In buffer-packing transfers, a contiguous block of data is transferred across the network. If the data are not stored contiguously, they are copied to (gathering) or from (scattering) buffers in local memory before and after the transfer. Chained transfers perform gathering, transfer and scattering in one step, reading the data elements with some non-sequential pattern and immediately transferring them on to the destination.Our model and measurements indicate that chaining of the gather, transfer, and scatter operations results in better performance than buffer packing for many important access patterns. Most standard message passing libraries (like MPI, PVM or NX) force the parallelizing compiler (or the programmer) to employ the buffer-packing communication operations. However, the addition of hardware support dedicated to communication (e.g., DMAs, line-transfer units) now gives the compiler a wider range of options.
- 1.D Adams. Gray T3D System Architecture Overview. Technical report, Gray Research Inc., September I993. Revision 1.C.Google Scholar
- 2.G. Blelloch and J. Sipelsteln. Collectlon-Oriented Languages. Proc. IEEE, 79(4)'504-523, Apr 199 IGoogle Scholar
- 3.Intel Corp. ParagonX/PS Product Overview Intel Corp., March I991Google Scholar
- 4.Gray Research Inc. GRAY T3D Apphcations Programmtng Course, Nov 1993 TR-T3DAPPL.Google Scholar
- 5.High Performance Fortran Forum. High Performance Fortran language specification version 1.0 draft, January 1993.Google Scholar
- 6.T. Gross, D. O'Hallaron, and J. Subhlok. Task Parallelism in a High Performance Fortran Framework. IEEE Parallel and Distributed Technology, 2(3): 16-26, Fall 1994. Google ScholarDigital Library
- 7.K. Hayashi, T. Doi, T. Horie, Y. Koyanagl, O. Shiraki, N. Imamura, T. Shimizu, H. Ishihata, and T. Shindo. Ap 1000+: Architectural Support of a put/get Interface for Parallelizing Compilers. In Proc. of ASPLOS IV, pages 196-207. ACM, Oct 1994. Google ScholarDigital Library
- 8.S. Hinrichs, C. Kosak, D. O'Hallaron, T Stricker, and R. Take An Architecture for Optimal All-to-All Personalized Communication. In ACM Symposium on Parallel Algorithms and Architectures, pages 310-319, Cape May, New Jersey, June 1994. A revised version is available as Tech. Report CMU-CS-94-140. Google ScholarDigital Library
- 9.C. Leiserson, A. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill, D. Hillis, B. Kuszmaul, M. St.Pierre, D. Wells, M. Wong, S. Yang, and R. Zak. The Network Architecture of the Connection Machine CM-5. In Symposium on Parallel Algorithms and Architectures, pages 272-285, San Diego, June 1992 ACM. Google ScholarDigital Library
- 10.A. B. Maccabe, K. S. McCurley, R. Riesen, and S R. Wheat. SUNMOS for the Intel Paragon: A Brief User's Guide. In Proceedings of the lntel Supercomputer Users' Group 1994 Annual North America Users' Con{erence , pages 245-251, June 1994. ftp.cs.sandia.gov/pub/sunmos/paperslpublished/ISUG94-1.ps.Google ScholarCross Ref
- 11.G. McRae, W. Goodin, and J Seinfeld. Development of a Second-Generation Mathematical Model for Urban Air Pollution - Model Formulation. Atmospheric Environment, 16(4):679- 696, 1982.Google ScholarCross Ref
- 12.R. Numrich, E Springer, and J. Peterson. Measurement of Communication Rates on the Gray T3D Interproeessor Network. In Proc. HPCN Europe '94, Vol. ii, pages 150-157, Munich, April 1994. Springer Verlag. Lecture Notes in Computer Science, Vol. 797. Google ScholarDigital Library
- 13.W. Oed. The Gray Research Massively Parallel Processor System Gray T3D, 1993. Available from via ftp from tray.com.Google Scholar
- 14.E. J. Schwabe, G. E. Blelloch, A. Feldmann, O. Ghattas, J. R. Gilbert, G. L. Miller, D. R. O'Hallaron, J. R. Shewchuk, and S. Teng. A Separator-Based Framework for Automated Partitioning and Mapping of Parallel Algorithms for Numerical Solution of PDEs. In Proceedings o/the 1992 DAGSAOC Symposium, pages 48-62, June 1992. Revised version accepted for Comm. ACM.Google Scholar
- 15.J. Stichnoth, D. O'Hallaron, and T. Gross. Generating Communication for Array Statements: Design, Implementation, and Evaluation. Journal o.1' Parallel and Dtstributed Computing, 21(1):150-159, 1994. Google ScholarDigital Library
- 16.T. Stricker, J. Stichnoth, D. O'Hallaron, S. Hinnchs, and T. Gross. The Performance Impact of Fast Synchronizatxon m Parallel Computers To appear m Proceedings of International Conference of Supercomputing, Barcelona, Spain, July 1995.Google Scholar
- 17.T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages. a Mechanism for Integrated Communication and Computation. In Proc. 19th Intl. ConL on Computer Architecture, pages 256-266, May 1992. Google ScholarDigital Library
Index Terms
- Optimizing memory system performance for communication in parallel computers
Recommendations
Optimizing memory system performance for communication in parallel computers
Special Issue: Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95)Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of ...
High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers
In this paper, we propose high-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. We use the four-step or six-step FFT algorithms to implement the radix-2, 3 and 5 parallel 1-D complex FFT ...
Measuring the performance of parallel computers with distributed memory
Basic techniques for measuring the performance of parallel computers with distributed memory are considered. The results obtained via the de-facto standard LINPACK benchmark suite are shown to be weakly related to the efficiency of applied parallel ...
Comments