research-article

GPU-SM: shared memory multi-GPU programming

Authors:
Javier Cabezas

Barcelona Supercomputing Center, Spain

Barcelona Supercomputing Center, Spain
View Profile

,
Marc Jordà

Barcelona Supercomputing Center, Spain

Barcelona Supercomputing Center, Spain
View Profile

,
Isaac Gelado

NVIDIA, USA

NVIDIA, USA
View Profile

,
Nacho Navarro

Universitat Politècnica de Catalunya, Spain / Barcelona Supercomputing Center, Spain

Universitat Politècnica de Catalunya, Spain / Barcelona Supercomputing Center, Spain
View Profile

,
Wen-mei Hwu

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUsFebruary 2015Pages 13–24https://doi.org/10.1145/2716282.2716286

Published:07 February 2015Publication History

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Pages 13–24

ABSTRACT

Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g. a 40% SLOC reduction in the host code of finite difference).

References

Green500 list - June 2014. http://www.green500.org/lists/green201406/.Google Scholar
TOP500 list - November 2014. http://top500.org/list/2014/11/.Google Scholar
NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2012.Google Scholar
APU 101: All about AMD Fusion Accelerated Processing Units, 2013.Google Scholar
NVIDIA NVLINK high-speed interconnect. http://www.nvidia.com/object/nvlink.html, 2014.Google Scholar
Tegra K1 Next-Gen Mobile Processor, 2014.Google Scholar
A. Agarwal, R. Bianchini, D. Chaiken, K.L. Johnson, D. Kranz, J. Kubiatowicz, B.H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995., pages 2–13, June 1995. Google ScholarDigital Library
A. Agarwal, B.H. Lim, D. Kranz, and J. Kubiatowicz. April: a processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture., pages 104–114, May 1990. Google ScholarDigital Library
E. Anderson, J. Brooks, C. Grassl, and S. Scott. Performance of the cray t3e multiprocessor. In Supercomputing, ACM/IEEE 1997 Conference, pages 39–39, Nov 1997. Google ScholarDigital Library
M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguadé, J.M. Cela, and M. Valero. Assessing Accelerator-Based HPC Reverse Time Migration. IEEE Transactions on Parallel and Distributed Systems, 22(1):147–162, 2011. Google ScholarDigital Library
W. Bolosky, R. Fitzgerald, and M. Scott. Simple but effective techniques for numa memory management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP ’89, pages 19–31. ACM, 1989. Google ScholarDigital Library
William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. NUMA policies and their relation to memory architecture. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, ASPLOS IV, pages 212–221. ACM, 1991. Google ScholarDigital Library
David E. Culler, Anoop Gupta, and Jaswinder Pal Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1997. Google ScholarDigital Library
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, pages 381–394. ACM, 2013. Google ScholarDigital Library
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pages 4:1–4:12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
O. Fialka and M. Cadik. Fft and convolution performance in image filtering on gpu. In Information Visualization, 2006. IV 2006. Tenth International Conference on, pages 609–614, July 2006. Google ScholarDigital Library
Peter N. Glaskowsky. NVIDIA’s Fermi: The First Complete GPU Computing Architecture, 2009.Google Scholar
Intel Corporation. Ivy Bridge Archictecture, 2011.Google Scholar
The Khronos Group Inc. The OpenCL Specification, 2013.Google Scholar
Kiyoshi Kurihara, David Chaiken, and Anant Agarwal. Latency tolerance through multithreading in large-scale multiprocessors. In Proceedings of the International Symposium on Shared Memory Multiprocessing, pages 91–101. IPS Press, 1991.Google Scholar
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, Monica S. Lam, and Dash The Ease of use. The Stanford DASH multiprocessor. IEEE Computer, 25:63–79, 1992. Google ScholarDigital Library
Paulius Micikevicius. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79–84. ACM, 2009. Google ScholarDigital Library
D. Molka, D. Hackenberg, R. Schone, and M.S. Muller. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques., pages 261–270, Sept 2009. Google ScholarDigital Library
D.S. Nikolopoulos, T.S. Papatheodorou, C.D. Polychronopoulos, J. Labarta, and E. Ayguadé. User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In Proceedings of 2000 International Conference on Parallel Processing, ICPP 2000, pages 95–103, 2000. Google ScholarDigital Library
NVIDIA Corporation. CUDA C Programming Guide, 2013.Google Scholar
D. Schaa and D. Kaeli. Exploring the multiple-gpu design space. In IEEE International Symposium on Parallel Distributed Processing., pages 1–12, May 2009. Google ScholarDigital Library
J.A. Stratton, C. Rodrigues, I-Jui Sung, Li-Wen Chang, N. Anssari, Geng Liu, W.-M.W. Hwu, and N. Obeid. Algorithm and data optimization techniques for scaling to massively threaded systems. Computer, 45(8):26–32, August 2012.Google ScholarDigital Library
Ivan Tanasic, Llu´ıs Vilanova, Marc Jordà, Javier Cabezas, Isaac Gelado, Nacho Navarro, and Wen-mei W. Hwu. Comparison based sorting for systems with multiple GPUs. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 1–11. ACM, 2013. Google ScholarDigital Library
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems Software, ISPASS 2010, pages 235–246, 2010.Google ScholarCross Ref

Index Terms

GPU-SM: shared memory multi-GPU programming

Recommendations

Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation

We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs
February 2015
120 pages
ISBN:9781450334075
DOI:10.1145/2716282
Program Chairs:
David Kaeli
Northeastern University, USA
,
John Cavazos
University of Delaware, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 February 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
I/O interconnects
Shared memory machines
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate57of129submissions,44%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 563
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GPU-SM: shared memory multi-GPU programming

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimized HPL for AMD GPU and multi-core CPU usage

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation