research-article

GPU-Aware Non-contiguous Data Movement In Open MPI

Authors:
Wei Wu

University of Tennessee, Knoxville, TN, USA

University of Tennessee, Knoxville, TN, USA
View Profile

,
George Bosilca

University of Tennessee, Knoxville, TN, USA

University of Tennessee, Knoxville, TN, USA
View Profile

,
Rolf vandeVaart

NVIDIA, Santa Clara, CA, USA

NVIDIA, Santa Clara, CA, USA
View Profile

,
Sylvain Jeaugey

NVIDIA, Santa Clara, CA, USA

NVIDIA, Santa Clara, CA, USA
View Profile

,
Jack Dongarra

University of Tennessee & Oak Ridge National Laboratory, Knoxville, TN, USA

University of Tennessee & Oak Ridge National Laboratory, Knoxville, TN, USA
View Profile

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed ComputingMay 2016Pages 231–242https://doi.org/10.1145/2907294.2907317

Published:31 May 2016Publication History

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pages 231–242

ABSTRACT

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applica- tions. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non- contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.

To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype pack- ing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unified Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.

References

A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, W.-c. Feng, K. R. Bisset, and R. Thakur. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems. In HPCC'12, pages 647--654, Washington, DC, USA, 2012. Google ScholarDigital Library
L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997. Google ScholarDigital Library
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In GPGPU-3 Workshop, pages 63--74, New York, NY, USA, 2010. Google ScholarDigital Library
T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In ISCA'92, pages 256--266, 1992. Google ScholarDigital Library
M. Forum. MPI-2: Extensions to the message-passing interface. In Univ. of Tennessee, Knoxville, Tech Report, 1996.Google Scholar
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In EuroMPI'04, pages 97--104, Budapest, Hungary, 2004.Google ScholarCross Ref
J. Jenkins, J. Dinan, P. Balaji, T. Peterka, N. Samatova, and R. Thakur. Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2627--2637, Oct 2014.Google Scholar
O. Lawlor. Message passing for GPGPU clusters: CudaMPI. In CLUSTER'09., pages 1--8, Aug 2009.Google ScholarCross Ref
NVIDIA. NVIDIA CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/index.html, 2015.Google Scholar
NVIDIA. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2015.Google Scholar
S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda. Extending OpenSHMEM for GPU Computing. In IPDPS'13, pages 1001--1012, May 2013. Google ScholarDigital Library
R. Ross, N. Miller, and W. Gropp. Implementing Fast and Reusable Datatype Processing. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2840 of Lecture Notes in Computer Science, pages 404--413. Springer Berlin Heidelberg, 2003.Google ScholarCross Ref
T. Schneider, R. Gerstenberger, and T. Hoefler. Micro-Applications for Communication Data Access Patterns and MPI Datatypes. In EuroMPI'12, pages 121--131, 2012. Google ScholarDigital Library
R. vandeVaart. Open MPI with RDMA support and CUDA. In NVIDIA GTC'14, 2014.Google Scholar
H. Wang, S. Potluri, D. Bureddy, C. Rosales, and D. Panda. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2595--2605, Oct 2014.Google Scholar
H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and D. Panda. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2. In CLUSTER'11, pages 308--316, Sept 2011. Google ScholarDigital Library
H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda. MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters. Computer Science - Research and Development, 26(3--4):257--266, 2011. Google ScholarDigital Library
L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing. In ICS'16, Istanbul, Turkey, 2016. Google ScholarDigital Library

Index Terms

GPU-Aware Non-contiguous Data Movement In Open MPI

Recommendations

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUs

The CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
Read More
GPU-Aware Intranode MPI_Allreduce
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

Modern multi-core clusters are increasingly using GPUs to achieve higher performance and power efficiency. In such clusters, efficient communication among processes with data residing in GPU memory is of paramount importance to the performance of MPI ...
Read More
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

With the raw computing power of graphics processing units (GPUs) being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
May 2016
302 pages
ISBN:9781450343145
DOI:10.1145/2907294
General Chair:
Hiroshi Nakashima
Kyoto University, Japan
,
Program Chairs:
Kenjiro Taura
The University of Tokyo, Japan
,
Jack Lange
University of Pittsburgh, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
datatype
gpu
hybrid architecture
mpi
non-contiguous data
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '16 Paper Acceptance Rate20of129submissions,16%Overall Acceptance Rate166of966submissions,17%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 262
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GPU-Aware Non-contiguous Data Movement In Open MPI

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures

GPU-Aware Intranode MPI_Allreduce

Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GPU-Aware Non-contiguous Data Movement In Open MPI

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures

GPU-Aware Intranode MPI_Allreduce

Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media