research-article

A framework for dynamically instrumenting GPU compute applications within GPU Ocelot

Authors:

Naila Farooqui,

Gregory Diamos,

S. Yalamanchili,

K. SchwanAuthors Info & Claims

GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Article No.: 9, Pages 1 - 9

https://doi.org/10.1145/1964179.1964192

Published: 05 March 2011 Publication History

Abstract

In this paper we present the design and implementation of a dynamic instrumentation infrastructure for PTX programs that procedurally transforms kernels and manages related data structures. We show how performing instrumentation within the GPU Ocelot dynamic compiler infrastructure provides unique capabilities not available to other profiling and instrumentation toolchains for GPU computing. We demonstrate the utility of this instrumentation capability with three example scenarios - (1) performing workload characterization accelerated by a GPU, (2) providing load imbalance information for use by a resource allocator, and (3) providing compute utilization feedback to be used online by a simulated process scheduler that might be found in a hypervisor. Additionally, we measure both (1) the compilation overheads of performing dynamic compilation and (2) the increases in runtimes when executing instrumented kernels. On average, compilation overheads due to instrumentation consisted of 69% of the time needed to parse a kernel module, in the case of the Parboil benchmark suite. Slowdowns for instrumenting each basic block ranged from 1.5x to 5.5x, with the largest slowdowns attributed to kernels with large numbers of short, compute-bound blocks.

References

[1]

Khronos OpenCL Working Group. The OpenCL Specification, December 2008.

[2]

Nvidia. NVIDIA CUDA Compute Unified Device Architecture. NVIDIA Corporation, Santa Clara, California, 2.1 edition, October 2008.

[3]

Nvidia. NVIDIA Compute PTX: Parallel Thread Execution. NVIDIA Corporation, Santa Clara, California, 1.3 edition, October 2008.

[4]

Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili. Gpuocelot: A binary translation framework for ptx., June 2009. http://code.google.com/p/gpuocelot/.

[5]

Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pages 353--364, New York, NY, USA, 2010. ACM.

Digital Library

[6]

Rodrigo Dominguez, Dana Schaa, and David Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedings of the 4th Workshop on General-Purpose Computation on Graphics Processing Units, 2011. To appear.

Digital Library

[7]

Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):451--490, Oct 1991.

Digital Library

[8]

Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. A characterization and analysis of ptx kernels. Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, 2009.

Digital Library

[9]

Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004.

Digital Library

[10]

Impact. The parboil benchmark suite, 2007.

[11]

Nvidia Corporation. Nvidia's next generation compute architecture: Fermi. white paper, NVIDIA, November 2009.

[12]

Nvidia. NVIDIA Compute Visual Profiler. NVIDIA Corporation, Santa Clara, California, 1.0 edition, October 2010.

[13]

Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche, Niraj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. Gvim: Gpu-accelerated virtual machines. In Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, HPCVirt '09, pages 17--24, New York, NY, USA, 2009. ACM.

Digital Library

[14]

Sunpyo Hong and Hyesoon Kim. An integrated gpu power and performance model. Computer Architecture. IEEE International Symposium on, 2010.

Digital Library

[15]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 190--200, New York, NY, USA, 2005. ACM.

Digital Library

[16]

Michael Boyer, Kevin Skadron, and Westley Weimer. Automated dynamic analysis of cuda programs. Third Workshop on Software Tools for MultiCore Systems (STMCS), 2008.

[17]

Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, USA, April 2009.

[18]

Sylvain Collange, David Defour, and David Parello. Barra, a modular functional gpu simulator for gpgpu. Technical Report hal-00359342, 2009.

[19]

Yao Zhang and John D. Owens. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA 17), February 2011.

Digital Library

[20]

Sunpyo Hong and Hyesoon Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009.

Digital Library

Cited By

Fang JWei ZLiu YHou Y(2023)A Task-Based Routing Algorithm for Network-on-Chip in Heterogeneous CPU-GPU Architectures2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00110(758-763)Online publication date: 17-Dec-2023
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00110
Fang JWei ZLiu YHou Y(2023)TB-TBP: a task-based adaptive routing algorithm for network-on-chip in heterogenous CPU-GPU architecturesThe Journal of Supercomputing10.1007/s11227-023-05700-780:5(6311-6335)Online publication date: 23-Oct-2023
https://doi.org/10.1007/s11227-023-05700-7
Bao YSun YFeric ZShen MWeston MAbellán JBaruah TKim JJoshi AKaeli DKloeckner AMoreira J(2022)NaviSimProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569666(333-345)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569666
Show More Cited By

Index Terms

A framework for dynamically instrumenting GPU compute applications within GPU Ocelot

Recommendations

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (...
Modeling GPU-CPU workloads and systems
GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units

Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well ...
Caracal: dynamic translation of runtime environments for GPUs
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Graphics Processing Units (GPU) have become the platform of choice for accelerating a large range of data parallel and task parallel applications. Both AMD and NVIDIA have developed GPU implementations targeted at the high performance computing market. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

March 2011

101 pages

ISBN:9781450305693

DOI:10.1145/1964179

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 March 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

GPGPU-4

GPGPU-4: Fourth Workshop on General Purpose Processing on Graphics Processing Units

March 5, 2011

California, Newport Beach, USA

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
559
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fang JWei ZLiu YHou Y(2023)A Task-Based Routing Algorithm for Network-on-Chip in Heterogeneous CPU-GPU Architectures2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00110(758-763)Online publication date: 17-Dec-2023
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00110
Fang JWei ZLiu YHou Y(2023)TB-TBP: a task-based adaptive routing algorithm for network-on-chip in heterogenous CPU-GPU architecturesThe Journal of Supercomputing10.1007/s11227-023-05700-780:5(6311-6335)Online publication date: 23-Oct-2023
https://doi.org/10.1007/s11227-023-05700-7
Bao YSun YFeric ZShen MWeston MAbellán JBaruah TKim JJoshi AKaeli DKloeckner AMoreira J(2022)NaviSimProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569666(333-345)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569666
Jiang JQi JShen TChen XZhao SWang SChen LZhang GLuo XCui H(2022)CRONUS: Fault-isolated, Secure and High-performance Heterogeneous Computing for Trusted Execution Environment2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00019(124-143)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00019
Tripathy DAbdolrashidi AFan QWong DSatpathy M(2021)LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605411(1-8)Online publication date: Oct-2021
https://doi.org/10.1109/NAS51552.2021.9605411
Abdolrashidi AEsfeden HJahanshahi ASingh KAbu-Ghazaleh NWong DMartínez JDuato JJohn L(2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00034
Eberius DBoehme DPearce O(2021)Did the GPU obfuscate the load imbalance in my MPI simulation?2021 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/HiPar54615.2021.00008(20-29)Online publication date: Nov-2021
https://doi.org/10.1109/HiPar54615.2021.00008
Villa OStephenson MNellans DKeckler S(2019)NVBitProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358307(372-383)Online publication date: 12-Oct-2019
https://dl.acm.org/doi/10.1145/3352460.3358307
Braun LFroning H(2019)CUDA Flux: A Lightweight Instruction Profiler for CUDA Applications2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS49563.2019.00014(73-81)Online publication date: Nov-2019
https://doi.org/10.1109/PMBS49563.2019.00014
Kiani MRajabzadeh A(2018)Efficient Cache Performance Modeling in GPUs Using Reuse Distance AnalysisACM Transactions on Architecture and Code Optimization10.1145/329105115:4(1-24)Online publication date: 19-Dec-2018
https://dl.acm.org/doi/10.1145/3291051
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten