research-article

Flexible software profiling of GPU architectures

Authors:
Mark Stephenson

NVIDIA

NVIDIA
View Profile

,
Siva Kumar Sastry Hari

NVIDIA

NVIDIA
View Profile

,
Yunsup Lee

University of California, Berkeley

University of California, Berkeley
View Profile

,
Eiman Ebrahimi

NVIDIA

NVIDIA
View Profile

,
Daniel R. Johnson

NVIDIA

NVIDIA
View Profile

,
David Nellans

NVIDIA

NVIDIA
View Profile

,
Mike O'Connor

NVIDIA and The University of Texas at Austin

NVIDIA and The University of Texas at Austin
View Profile

,
Stephen W. Keckler

NVIDIA and The University of Texas at Austin

NVIDIA and The University of Texas at Austin
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 43 Issue 3SJune 2015pp 185–197https://doi.org/10.1145/2872887.2750375

Published:13 June 2015Publication History

ACM SIGARCH Computer Architecture News

Abstract

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code "SASS" Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.

References

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2009, pp. 163--174.Google Scholar
N. Bell and M. Garland, "Efficient Sparse Matrix-Vector Multiplication on CUDA," NVIDIA, Tech. Rep. NVR-2008-004, December 2008.Google Scholar
P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy, H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang, "Mambo: A Full System Simulator for the PowerPC Architecture," ACM SIGMETRICS Performance Evaluation Review, vol. 31, no. 4, pp. 8--12, 2004. Google ScholarDigital Library
E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl, "PROTEUS: A High-performance Parallel-architecture Simulator," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), June 1992, pp. 247--248. Google ScholarDigital Library
D. Brooks and M. Martonosi, "Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), January 1999, pp. 13--22. Google ScholarDigital Library
M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proceedings of the International Symposium on Workload Characterization (IISWC), November 2012, pp. 141--151. Google ScholarDigital Library
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), October 2009, pp. 44--54. Google ScholarDigital Library
B. Cmelik and D. Keppel, "Shade: A Fast Instruction-set Simulator for Execution Profiling," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1994, pp. 128--137. Google ScholarDigital Library
R. C. Covington, S. Madala, V. Mehta, J. R. Jump, and J. B. Sinclair, "The Rice Parallel Processing Testbed," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1988, pp. 4--11. Google ScholarDigital Library
H. Davis, S. R. Goldschmidt, and J. Hennessy, "Multiprocessor Tracing and Simulation Using Tango," in Proceedings of the International Conference on Parallel Processing (ICPP), August 1991.Google Scholar
J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos, "ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 1997, pp. 292--302. Google ScholarDigital Library
Derek Bruening, "Efficient, Transparent, and Comprehensive Runtime Code Manipulation," Ph.D. dissertation, Massachusetts Institute of Technology, 2004. Google ScholarDigital Library
G. Diamos, A. Kerr, and M. Kesavan, "Translating GPU Binaries to Tiered Many-Core Architectures with Ocelot," Georgia Institute of Technology Center for Experimental Research in Computer Systems (CERCS), Tech. Rep. 0901, January 2009.Google Scholar
B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, "GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications," in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2014, pp. 221--230.Google Scholar
N. Farooqui, A. Kerr, G. Diamos, S. Yalamanchili, and K. Schwan, "A Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, March 2011. Google ScholarDigital Library
S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer, "SASSIFI: Evaluating Resilience of GPU Applications," in Proceedings of the Workshop on Silicon Errors in Logic - System Effects (SELSE), April 2015.Google Scholar
M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich, "Improving Performance via Mini-applications," Sandia National Labs, Tech. Rep. SAND2009-5574, September 2009.Google Scholar
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, "High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP)," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2010, pp. 60--71. Google ScholarDigital Library
Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler, and K. Asanović, "Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2014, pp. 101--113. Google ScholarDigital Library
Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. Asanovic, "Convergence and Scalarization for Data-parallel Architectures," in International Symposium on Code Generation and Optimization (CGO), February 2013, pp. 1--11. Google ScholarDigital Library
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 2005, pp. 190--200. Google ScholarDigital Library
J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2010, pp. 235--246. Google ScholarDigital Library
J. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, "Graphite: A Distributed Parallel Simulator for Multicores," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), January 2010, pp. 1--12.Google Scholar
T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in Onchip Networks," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2009, pp. 196--207. Google ScholarDigital Library
O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2007, pp. 146--160. Google ScholarDigital Library
S. Narayanasamy, G. Pokam, and B. Calder, "BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging," in Proceedings of the International Symposium on Computer Architecture (ISCA), May 2005, pp. 284--295. Google ScholarDigital Library
National Energy Research Scientific Computing Center, "MiniFE," https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/minife, 2014.Google Scholar
N. Nethercote and J. Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 2007, pp. 89--100. Google ScholarDigital Library
NVIDIA. (2013, November) Unified Memory in CUDA 6. Available: http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/Google Scholar
NVIDIA. (2014, August) CUDA C Best Practices Guides. Available: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.htmlGoogle Scholar
NVIDIA. (2014, August) CUDA-GDB: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cuda-gdb/index.htmlGoogle Scholar
NVIDIA. (2014, November) CUDA Programming Guide: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/Google Scholar
NVIDIA. (2014, November) CUPTI: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cupti/index.htmlGoogle Scholar
NVIDIA. (2014) NVIDIA NSIGHT User Guide. Available: http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight_visual_studio_edition_user_guide.htmGoogle Scholar
NVIDIA. (2014, August) Visual Profiler Users's Guide. Available: http://docs.nvidia.com/cuda/profiler-users-guideGoogle Scholar
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2013, pp. 99--110. Google ScholarDigital Library
J. Sartori and R. Kumar, "Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications," IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 279--290, February 2013. Google ScholarDigital Library
A. Srivastava and A. Eustace, "ATOM: A System for Building Customized Program Analysis Tools," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 1994, pp. 196--205. Google ScholarDigital Library
J. E. Stone, D. Gohara, and G. Shi, "OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems," Computing in Science and Engineering, vol. 12, no. 3, pp. 66--73, May/June 2010. Google ScholarDigital Library
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," University of Illinois at Urbana-Champaign, Center for Reliable and High-Performance Computing, Tech. Rep. IMPACT-12-01, March 2012.Google Scholar
S. Tallam and R. Gupta, "Bitwidth Aware Global Register Allocation," in Proceedings of the Symposium on Principles of Programming Languages (POPL), January 2003, pp. 85--96. Google ScholarDigital Library
P. Xiang, Y. Yang, and H. Zhou, "Warp-level Divergence in GPUs: Characterization, Impact, and Mitigation," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), February 2014, pp. 284--295.Google Scholar

Index Terms

Flexible software profiling of GPU architectures
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

Flexible software profiling of GPU architectures
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, ...
Read More
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Read More
Evaluation of GPU Architectures Using Spiking Neural Networks
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia'...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2015
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 69
  Total Citations
  View Citations
- 708
  Total Downloads
- Downloads (Last 12 months)67
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Flexible software profiling of GPU architectures

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Flexible software profiling of GPU architectures

Software Transactional Memory for GPU Architectures

Evaluation of GPU Architectures Using Spiking Neural Networks