skip to main content
10.1145/2716282.2716286acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

GPU-SM: shared memory multi-GPU programming

Published:07 February 2015Publication History

ABSTRACT

Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g. a 40% SLOC reduction in the host code of finite difference).

References

  1. Green500 list - June 2014. http://www.green500.org/lists/green201406/.Google ScholarGoogle Scholar
  2. TOP500 list - November 2014. http://top500.org/list/2014/11/.Google ScholarGoogle Scholar
  3. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2012.Google ScholarGoogle Scholar
  4. APU 101: All about AMD Fusion Accelerated Processing Units, 2013.Google ScholarGoogle Scholar
  5. NVIDIA NVLINK high-speed interconnect. http://www.nvidia.com/object/nvlink.html, 2014.Google ScholarGoogle Scholar
  6. Tegra K1 Next-Gen Mobile Processor, 2014.Google ScholarGoogle Scholar
  7. A. Agarwal, R. Bianchini, D. Chaiken, K.L. Johnson, D. Kranz, J. Kubiatowicz, B.H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995., pages 2–13, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Agarwal, B.H. Lim, D. Kranz, and J. Kubiatowicz. April: a processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture., pages 104–114, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Anderson, J. Brooks, C. Grassl, and S. Scott. Performance of the cray t3e multiprocessor. In Supercomputing, ACM/IEEE 1997 Conference, pages 39–39, Nov 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguadé, J.M. Cela, and M. Valero. Assessing Accelerator-Based HPC Reverse Time Migration. IEEE Transactions on Parallel and Distributed Systems, 22(1):147–162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Bolosky, R. Fitzgerald, and M. Scott. Simple but effective techniques for numa memory management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP ’89, pages 19–31. ACM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. NUMA policies and their relation to memory architecture. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, ASPLOS IV, pages 212–221. ACM, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David E. Culler, Anoop Gupta, and Jaswinder Pal Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, pages 381–394. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pages 4:1–4:12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. O. Fialka and M. Cadik. Fft and convolution performance in image filtering on gpu. In Information Visualization, 2006. IV 2006. Tenth International Conference on, pages 609–614, July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Peter N. Glaskowsky. NVIDIA’s Fermi: The First Complete GPU Computing Architecture, 2009.Google ScholarGoogle Scholar
  18. Intel Corporation. Ivy Bridge Archictecture, 2011.Google ScholarGoogle Scholar
  19. The Khronos Group Inc. The OpenCL Specification, 2013.Google ScholarGoogle Scholar
  20. Kiyoshi Kurihara, David Chaiken, and Anant Agarwal. Latency tolerance through multithreading in large-scale multiprocessors. In Proceedings of the International Symposium on Shared Memory Multiprocessing, pages 91–101. IPS Press, 1991.Google ScholarGoogle Scholar
  21. Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, Monica S. Lam, and Dash The Ease of use. The Stanford DASH multiprocessor. IEEE Computer, 25:63–79, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Paulius Micikevicius. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79–84. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Molka, D. Hackenberg, R. Schone, and M.S. Muller. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques., pages 261–270, Sept 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D.S. Nikolopoulos, T.S. Papatheodorou, C.D. Polychronopoulos, J. Labarta, and E. Ayguadé. User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In Proceedings of 2000 International Conference on Parallel Processing, ICPP 2000, pages 95–103, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. NVIDIA Corporation. CUDA C Programming Guide, 2013.Google ScholarGoogle Scholar
  26. D. Schaa and D. Kaeli. Exploring the multiple-gpu design space. In IEEE International Symposium on Parallel Distributed Processing., pages 1–12, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J.A. Stratton, C. Rodrigues, I-Jui Sung, Li-Wen Chang, N. Anssari, Geng Liu, W.-M.W. Hwu, and N. Obeid. Algorithm and data optimization techniques for scaling to massively threaded systems. Computer, 45(8):26–32, August 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ivan Tanasic, Llu´ıs Vilanova, Marc Jordà, Javier Cabezas, Isaac Gelado, Nacho Navarro, and Wen-mei W. Hwu. Comparison based sorting for systems with multiple GPUs. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 1–11. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems Software, ISPASS 2010, pages 235–246, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. GPU-SM: shared memory multi-GPU programming

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs
          February 2015
          120 pages
          ISBN:9781450334075
          DOI:10.1145/2716282

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 February 2015

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate57of129submissions,44%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader