skip to main content
research-article

Flexible software profiling of GPU architectures

Published:13 June 2015Publication History
Skip Abstract Section

Abstract

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code "SASS" Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.

References

  1. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2009, pp. 163--174.Google ScholarGoogle Scholar
  2. N. Bell and M. Garland, "Efficient Sparse Matrix-Vector Multiplication on CUDA," NVIDIA, Tech. Rep. NVR-2008-004, December 2008.Google ScholarGoogle Scholar
  3. P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy, H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang, "Mambo: A Full System Simulator for the PowerPC Architecture," ACM SIGMETRICS Performance Evaluation Review, vol. 31, no. 4, pp. 8--12, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl, "PROTEUS: A High-performance Parallel-architecture Simulator," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), June 1992, pp. 247--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Brooks and M. Martonosi, "Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), January 1999, pp. 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proceedings of the International Symposium on Workload Characterization (IISWC), November 2012, pp. 141--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), October 2009, pp. 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Cmelik and D. Keppel, "Shade: A Fast Instruction-set Simulator for Execution Profiling," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1994, pp. 128--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. C. Covington, S. Madala, V. Mehta, J. R. Jump, and J. B. Sinclair, "The Rice Parallel Processing Testbed," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1988, pp. 4--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Davis, S. R. Goldschmidt, and J. Hennessy, "Multiprocessor Tracing and Simulation Using Tango," in Proceedings of the International Conference on Parallel Processing (ICPP), August 1991.Google ScholarGoogle Scholar
  11. J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos, "ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 1997, pp. 292--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Derek Bruening, "Efficient, Transparent, and Comprehensive Runtime Code Manipulation," Ph.D. dissertation, Massachusetts Institute of Technology, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Diamos, A. Kerr, and M. Kesavan, "Translating GPU Binaries to Tiered Many-Core Architectures with Ocelot," Georgia Institute of Technology Center for Experimental Research in Computer Systems (CERCS), Tech. Rep. 0901, January 2009.Google ScholarGoogle Scholar
  14. B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, "GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications," in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2014, pp. 221--230.Google ScholarGoogle Scholar
  15. N. Farooqui, A. Kerr, G. Diamos, S. Yalamanchili, and K. Schwan, "A Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, March 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer, "SASSIFI: Evaluating Resilience of GPU Applications," in Proceedings of the Workshop on Silicon Errors in Logic - System Effects (SELSE), April 2015.Google ScholarGoogle Scholar
  17. M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich, "Improving Performance via Mini-applications," Sandia National Labs, Tech. Rep. SAND2009-5574, September 2009.Google ScholarGoogle Scholar
  18. A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, "High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP)," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2010, pp. 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler, and K. Asanović, "Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2014, pp. 101--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. Asanovic, "Convergence and Scalarization for Data-parallel Architectures," in International Symposium on Code Generation and Optimization (CGO), February 2013, pp. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 2005, pp. 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2010, pp. 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, "Graphite: A Distributed Parallel Simulator for Multicores," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), January 2010, pp. 1--12.Google ScholarGoogle Scholar
  24. T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in Onchip Networks," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2009, pp. 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2007, pp. 146--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Narayanasamy, G. Pokam, and B. Calder, "BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging," in Proceedings of the International Symposium on Computer Architecture (ISCA), May 2005, pp. 284--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. National Energy Research Scientific Computing Center, "MiniFE," https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/minife, 2014.Google ScholarGoogle Scholar
  28. N. Nethercote and J. Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 2007, pp. 89--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. NVIDIA. (2013, November) Unified Memory in CUDA 6. Available: http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/Google ScholarGoogle Scholar
  30. NVIDIA. (2014, August) CUDA C Best Practices Guides. Available: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.htmlGoogle ScholarGoogle Scholar
  31. NVIDIA. (2014, August) CUDA-GDB: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cuda-gdb/index.htmlGoogle ScholarGoogle Scholar
  32. NVIDIA. (2014, November) CUDA Programming Guide: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/Google ScholarGoogle Scholar
  33. NVIDIA. (2014, November) CUPTI: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cupti/index.htmlGoogle ScholarGoogle Scholar
  34. NVIDIA. (2014) NVIDIA NSIGHT User Guide. Available: http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight_visual_studio_edition_user_guide.htmGoogle ScholarGoogle Scholar
  35. NVIDIA. (2014, August) Visual Profiler Users's Guide. Available: http://docs.nvidia.com/cuda/profiler-users-guideGoogle ScholarGoogle Scholar
  36. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2013, pp. 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Sartori and R. Kumar, "Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications," IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 279--290, February 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Srivastava and A. Eustace, "ATOM: A System for Building Customized Program Analysis Tools," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 1994, pp. 196--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. E. Stone, D. Gohara, and G. Shi, "OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems," Computing in Science and Engineering, vol. 12, no. 3, pp. 66--73, May/June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," University of Illinois at Urbana-Champaign, Center for Reliable and High-Performance Computing, Tech. Rep. IMPACT-12-01, March 2012.Google ScholarGoogle Scholar
  41. S. Tallam and R. Gupta, "Bitwidth Aware Global Register Allocation," in Proceedings of the Symposium on Principles of Programming Languages (POPL), January 2003, pp. 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. P. Xiang, Y. Yang, and H. Zhou, "Warp-level Divergence in GPUs: Characterization, Impact, and Mitigation," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), February 2014, pp. 284--295.Google ScholarGoogle Scholar

Index Terms

  1. Flexible software profiling of GPU architectures

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 43, Issue 3S
          ISCA'15
          June 2015
          745 pages
          ISSN:0163-5964
          DOI:10.1145/2872887
          Issue’s Table of Contents
          • cover image ACM Conferences
            ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
            June 2015
            768 pages
            ISBN:9781450334020
            DOI:10.1145/2749469

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 June 2015

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader