skip to main content
10.1145/2907294.2907317acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

GPU-Aware Non-contiguous Data Movement In Open MPI

Published:31 May 2016Publication History

ABSTRACT

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applica- tions. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non- contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.

To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype pack- ing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unified Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.

References

  1. A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, W.-c. Feng, K. R. Bisset, and R. Thakur. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems. In HPCC'12, pages 647--654, Washington, DC, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In GPGPU-3 Workshop, pages 63--74, New York, NY, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In ISCA'92, pages 256--266, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Forum. MPI-2: Extensions to the message-passing interface. In Univ. of Tennessee, Knoxville, Tech Report, 1996.Google ScholarGoogle Scholar
  6. E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In EuroMPI'04, pages 97--104, Budapest, Hungary, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Jenkins, J. Dinan, P. Balaji, T. Peterka, N. Samatova, and R. Thakur. Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2627--2637, Oct 2014.Google ScholarGoogle Scholar
  8. O. Lawlor. Message passing for GPGPU clusters: CudaMPI. In CLUSTER'09., pages 1--8, Aug 2009.Google ScholarGoogle ScholarCross RefCross Ref
  9. NVIDIA. NVIDIA CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/index.html, 2015.Google ScholarGoogle Scholar
  10. NVIDIA. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2015.Google ScholarGoogle Scholar
  11. S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda. Extending OpenSHMEM for GPU Computing. In IPDPS'13, pages 1001--1012, May 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Ross, N. Miller, and W. Gropp. Implementing Fast and Reusable Datatype Processing. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2840 of Lecture Notes in Computer Science, pages 404--413. Springer Berlin Heidelberg, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  13. T. Schneider, R. Gerstenberger, and T. Hoefler. Micro-Applications for Communication Data Access Patterns and MPI Datatypes. In EuroMPI'12, pages 121--131, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. vandeVaart. Open MPI with RDMA support and CUDA. In NVIDIA GTC'14, 2014.Google ScholarGoogle Scholar
  15. H. Wang, S. Potluri, D. Bureddy, C. Rosales, and D. Panda. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2595--2605, Oct 2014.Google ScholarGoogle Scholar
  16. H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and D. Panda. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2. In CLUSTER'11, pages 308--316, Sept 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda. MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters. Computer Science - Research and Development, 26(3--4):257--266, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing. In ICS'16, Istanbul, Turkey, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GPU-Aware Non-contiguous Data Movement In Open MPI

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
          May 2016
          302 pages
          ISBN:9781450343145
          DOI:10.1145/2907294

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 May 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          HPDC '16 Paper Acceptance Rate20of129submissions,16%Overall Acceptance Rate166of966submissions,17%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader