ABSTRACT
Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applica- tions. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non- contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.
To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype pack- ing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unified Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.
- A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, W.-c. Feng, K. R. Bisset, and R. Thakur. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems. In HPCC'12, pages 647--654, Washington, DC, USA, 2012. Google ScholarDigital Library
- L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997. Google ScholarDigital Library
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In GPGPU-3 Workshop, pages 63--74, New York, NY, USA, 2010. Google ScholarDigital Library
- T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In ISCA'92, pages 256--266, 1992. Google ScholarDigital Library
- M. Forum. MPI-2: Extensions to the message-passing interface. In Univ. of Tennessee, Knoxville, Tech Report, 1996.Google Scholar
- E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In EuroMPI'04, pages 97--104, Budapest, Hungary, 2004.Google ScholarCross Ref
- J. Jenkins, J. Dinan, P. Balaji, T. Peterka, N. Samatova, and R. Thakur. Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2627--2637, Oct 2014.Google Scholar
- O. Lawlor. Message passing for GPGPU clusters: CudaMPI. In CLUSTER'09., pages 1--8, Aug 2009.Google ScholarCross Ref
- NVIDIA. NVIDIA CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/index.html, 2015.Google Scholar
- NVIDIA. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2015.Google Scholar
- S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda. Extending OpenSHMEM for GPU Computing. In IPDPS'13, pages 1001--1012, May 2013. Google ScholarDigital Library
- R. Ross, N. Miller, and W. Gropp. Implementing Fast and Reusable Datatype Processing. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2840 of Lecture Notes in Computer Science, pages 404--413. Springer Berlin Heidelberg, 2003.Google ScholarCross Ref
- T. Schneider, R. Gerstenberger, and T. Hoefler. Micro-Applications for Communication Data Access Patterns and MPI Datatypes. In EuroMPI'12, pages 121--131, 2012. Google ScholarDigital Library
- R. vandeVaart. Open MPI with RDMA support and CUDA. In NVIDIA GTC'14, 2014.Google Scholar
- H. Wang, S. Potluri, D. Bureddy, C. Rosales, and D. Panda. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation. Parallel and Distributed Systems, IEEE Transactions on, 25(10):2595--2605, Oct 2014.Google Scholar
- H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and D. Panda. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2. In CLUSTER'11, pages 308--316, Sept 2011. Google ScholarDigital Library
- H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda. MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters. Computer Science - Research and Development, 26(3--4):257--266, 2011. Google ScholarDigital Library
- L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing. In ICS'16, Istanbul, Turkey, 2016. Google ScholarDigital Library
Index Terms
- GPU-Aware Non-contiguous Data Movement In Open MPI
Recommendations
Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUsThe CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
GPU-Aware Intranode MPI_Allreduce
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group MeetingModern multi-core clusters are increasingly using GPUs to achieve higher performance and power efficiency. In such clusters, efficient communication among processes with data residing in GPU memory is of paramount importance to the performance of MPI ...
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems
With the raw computing power of graphics processing units (GPUs) being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider ...
Comments