research-article

Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

Authors:

Barbara ChapmanAuthors Info & Claims

LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

Article No.: 6, Pages 1 - 10

https://doi.org/10.1145/3148173.3148184

Published: 12 November 2017 Publication History

Abstract

The latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers should use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.

References

[1]

Oak Ridge Leadership Computing Facility - Summit. https://www.olcf.ornl.gov/summit

[2]

OpenACC. http://www.openacc.org

[3]

2013. OpenMP 4.0 Specifications. (2013). http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf

[4]

Samuel F. Antao, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O'Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPC (LLVM-HPC '16). IEEE Press, Piscataway, NJ, USA, 1--11. https://doi.org/10.1109/LLVM-HPC.2016.6

[5]

Gheorghe-Teodor Bercea, Carlo Bertolli, Samuel F. Antao, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O'Brien. 2015. Performance Analysis of OpenMP on a GPU Using a CORAL Proxy Application. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS '15). ACM, New York, NY, USA, Article 2, 11 pages. https://doi.org/10.1145/2832087.2832089

Digital Library

[6]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44--54.

Digital Library

[7]

Shuai Che, Jeremy W Sheaffer, Michael Boyer, Lukasz G Szafaryn, Liang Wang, and Kevin Skadron. 2010. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In Workload Characterization (IISWC), 2010 IEEE International Symposium on. IEEE, 1--11.

Digital Library

[8]

L. Dagum and R. Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (Jan 1998), 46--55. https://doi.org/10.1109/99.660313

Digital Library

[9]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York, NY, USA, 63--74. https://doi.org/10.1145/1735688.1735702

Digital Library

[10]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalaso-mayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--10.

[11]

Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU Communication Management and Optimization. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). ACM, New York, NY, USA, 142--151. https://doi.org/10.1145/1993498.1993516

Digital Library

[12]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 60--71. https://doi.org/10.1145/1815961.1815971

Digital Library

[13]

Hao-Qiang Jin, Michael Frumkin, and Jerry Yan. 1999. The OpenMP implementation of NAS parallel benchmarks and its performance. (1999).

[14]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75--. http://dl.acm.org/citation.cfm?id=977395.977673

[15]

Lingda Li, Dong Tong, Zichao Xie, Junlin Lu, and Xu Cheng. 2012. Optimal Bypass Monitor for High Performance Last-level Caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12). ACM, New York, NY, USA, 315--324. https://doi.org/10.1145/2370816.2370862

Digital Library

[16]

M. Martineau, S. McIntosh-Smith, C. Bertolli, A. C. Jacob, S. F. Antao, A. Eichenberger, G. T. Bercea, T. Chen, T. Jin, K. O'Brien, G. Rokos, H. Sung, and Z. Sura. 2016. Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support. In 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 54--64. https://doi.org/10.1109/PMBS.2016.011

[17]

M. Martineau, S. McIntosh-Smith, and W. Gaudin. 2016. Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 338--347. https://doi.org/10.1109/IPDPSW.2016.70

[18]

Perhaad Mistry, Yash Ukidave, Dana Schaa, and David Kaeli. 2013. Valar: A Benchmark Suite to Study the Dynamic Behavior of Heterogeneous Systems. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). ACM, New York, NY, USA, 54--65. https://doi.org/10.1145/2458523.2458529

Digital Library

[19]

NVIDIA. 2007. Compute unified device architecture programming guide. (2007).

[20]

Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2012. Fast and Efficient Automatic Memory Management for GPUs Using Compiler-assisted Runtime Coherence Scheme. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12). ACM, New York, NY, USA, 33--42. https://doi.org/10.1145/2370816.2370824

[21]

Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive Insertion Policies for High Performance Caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA '07). ACM, New York, NY, USA, 381--391. https://doi.org/10.1145/1250662.1250709

Digital Library

[22]

John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66--73.

[23]

John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).

[24]

Yash Ukidave, Fanny Nina Paravecino, Leiming Yu, Charu Kalra, Amir Momeni, Zhongliang Chen, Nick Materise, Brett Daley, Perhaad Mistry, and David Kaeli. 2015. NUPAR: A Benchmark Suite for Modern GPU Architectures. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (ICPE '15). ACM, New York, NY, USA, 253--264. https://doi.org/10.1145/2668930.2688046

Digital Library

Cited By

Tandon SGrinberg LBercea GBertolli COlesen MBna SMalaya N(2024)Porting HPC Applications to AMD Instinct™ MI300A using Unified Memory and OpenMP®ISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528925(1-9)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528925
Akbarzadeh NDarabi SGheibi-Fetrat AMirzaei ASadrosadati MSarbazi-Azad H(2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639038
Bertolli CBlass TStringer LAschenbrenner NLehr JBercea DChakrabarti DMeadows LLieberman R(2024)Performance Analysis of Runtime Handling of Zero-Copy for OpenMP Programs on MI300A APUsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00183(1420-1429)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00183
Show More Cited By

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Implementing OpenMP’s SIMD Directive in LLVM’s GPU Runtime
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

GPUs support three levels of parallelism: thread blocks, warps (or wavefronts) within a block, and threads within a warp. Some GPU programming models allow the use of all three of these levels, such as OpenMP offloading with the teams, parallel, and simd ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

November 2017

106 pages

ISBN:9781450355650

DOI:10.1145/3148173

Copyright © 2017 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

CO, Denver, USA

Acceptance Rates

LLVM-HPC'17 Paper Acceptance Rate 9 of 10 submissions, 90%;

Overall Acceptance Rate 16 of 22 submissions, 73%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
599
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)7

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tandon SGrinberg LBercea GBertolli COlesen MBna SMalaya N(2024)Porting HPC Applications to AMD Instinct™ MI300A using Unified Memory and OpenMP®ISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528925(1-9)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528925
Akbarzadeh NDarabi SGheibi-Fetrat AMirzaei ASadrosadati MSarbazi-Azad H(2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639038
Bertolli CBlass TStringer LAschenbrenner NLehr JBercea DChakrabarti DMeadows LLieberman R(2024)Performance Analysis of Runtime Handling of Zero-Copy for OpenMP Programs on MI300A APUsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00183(1420-1429)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00183
Ji HVanavasam SZhou YXia QHuang JYuan YWang RGupta PChitlur BJeong IKim N(2024)Demystifying a CXL Type-2 Device: A Heterogeneous Cooperative Computing Perspective2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00110(1504-1517)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00110
Anastasiadis PPapadopoulou NGoumas GKoziris NHoppe DZhong L(2023)PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/362456920:4(1-25)Online publication date: 15-Sep-2023
https://dl.acm.org/doi/10.1145/3624569
Hong JYou Y(2023)Mapping-Free GPU Offloading in OpenMP Using Unified MemoryProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605907(104-111)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605731.3605907
Yan KShi YYan YChen QHuang ZSi M(2023)Exploring OpenMP GPU Offloading for Implementing Convolutional Neural NetworksProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582523(60-69)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3582514.3582523
Bhattacharjee ADaley CJannesari A(2023)OpenMP Offload Features and Strategies for High Performance across Architectures and Compilers2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00098(564-573)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00098
Guo HZhang LZhang YLi J(2023)Data transfer management policy optimization for unified memory2023 5th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE58541.2023.10142582(447-453)Online publication date: 14-Apr-2023
https://doi.org/10.1109/CISCE58541.2023.10142582
Guo HZhang LZhang YLi JXu XLiu LCai KWu DYang SKong LGao X(2023)OpenMP offloading data transfer optimization for DCUsThe Journal of Supercomputing10.1007/s11227-023-05422-w80:2(2381-2402)Online publication date: 2-Aug-2023
https://doi.org/10.1007/s11227-023-05422-w
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten