ABSTRACT
Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g. a 40% SLOC reduction in the host code of finite difference).
- Green500 list - June 2014. http://www.green500.org/lists/green201406/.Google Scholar
- TOP500 list - November 2014. http://top500.org/list/2014/11/.Google Scholar
- NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2012.Google Scholar
- APU 101: All about AMD Fusion Accelerated Processing Units, 2013.Google Scholar
- NVIDIA NVLINK high-speed interconnect. http://www.nvidia.com/object/nvlink.html, 2014.Google Scholar
- Tegra K1 Next-Gen Mobile Processor, 2014.Google Scholar
- A. Agarwal, R. Bianchini, D. Chaiken, K.L. Johnson, D. Kranz, J. Kubiatowicz, B.H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995., pages 2–13, June 1995. Google ScholarDigital Library
- A. Agarwal, B.H. Lim, D. Kranz, and J. Kubiatowicz. April: a processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture., pages 104–114, May 1990. Google ScholarDigital Library
- E. Anderson, J. Brooks, C. Grassl, and S. Scott. Performance of the cray t3e multiprocessor. In Supercomputing, ACM/IEEE 1997 Conference, pages 39–39, Nov 1997. Google ScholarDigital Library
- M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguadé, J.M. Cela, and M. Valero. Assessing Accelerator-Based HPC Reverse Time Migration. IEEE Transactions on Parallel and Distributed Systems, 22(1):147–162, 2011. Google ScholarDigital Library
- W. Bolosky, R. Fitzgerald, and M. Scott. Simple but effective techniques for numa memory management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP ’89, pages 19–31. ACM, 1989. Google ScholarDigital Library
- William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. NUMA policies and their relation to memory architecture. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, ASPLOS IV, pages 212–221. ACM, 1991. Google ScholarDigital Library
- David E. Culler, Anoop Gupta, and Jaswinder Pal Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1997. Google ScholarDigital Library
- Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, pages 381–394. ACM, 2013. Google ScholarDigital Library
- Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pages 4:1–4:12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- O. Fialka and M. Cadik. Fft and convolution performance in image filtering on gpu. In Information Visualization, 2006. IV 2006. Tenth International Conference on, pages 609–614, July 2006. Google ScholarDigital Library
- Peter N. Glaskowsky. NVIDIA’s Fermi: The First Complete GPU Computing Architecture, 2009.Google Scholar
- Intel Corporation. Ivy Bridge Archictecture, 2011.Google Scholar
- The Khronos Group Inc. The OpenCL Specification, 2013.Google Scholar
- Kiyoshi Kurihara, David Chaiken, and Anant Agarwal. Latency tolerance through multithreading in large-scale multiprocessors. In Proceedings of the International Symposium on Shared Memory Multiprocessing, pages 91–101. IPS Press, 1991.Google Scholar
- Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, Monica S. Lam, and Dash The Ease of use. The Stanford DASH multiprocessor. IEEE Computer, 25:63–79, 1992. Google ScholarDigital Library
- Paulius Micikevicius. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79–84. ACM, 2009. Google ScholarDigital Library
- D. Molka, D. Hackenberg, R. Schone, and M.S. Muller. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques., pages 261–270, Sept 2009. Google ScholarDigital Library
- D.S. Nikolopoulos, T.S. Papatheodorou, C.D. Polychronopoulos, J. Labarta, and E. Ayguadé. User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In Proceedings of 2000 International Conference on Parallel Processing, ICPP 2000, pages 95–103, 2000. Google ScholarDigital Library
- NVIDIA Corporation. CUDA C Programming Guide, 2013.Google Scholar
- D. Schaa and D. Kaeli. Exploring the multiple-gpu design space. In IEEE International Symposium on Parallel Distributed Processing., pages 1–12, May 2009. Google ScholarDigital Library
- J.A. Stratton, C. Rodrigues, I-Jui Sung, Li-Wen Chang, N. Anssari, Geng Liu, W.-M.W. Hwu, and N. Obeid. Algorithm and data optimization techniques for scaling to massively threaded systems. Computer, 45(8):26–32, August 2012.Google ScholarDigital Library
- Ivan Tanasic, Llu´ıs Vilanova, Marc Jordà, Javier Cabezas, Isaac Gelado, Nacho Navarro, and Wen-mei W. Hwu. Comparison based sorting for systems with multiple GPUs. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 1–11. ACM, 2013. Google ScholarDigital Library
- H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems Software, ISPASS 2010, pages 235–246, 2010.Google ScholarCross Ref
Index Terms
- GPU-SM: shared memory multi-GPU programming
Recommendations
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation
We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming ...
Comments