skip to main content
10.1145/1513895.1513903acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Architecture-aware optimization targeting multithreaded stream computing

Published:08 March 2009Publication History

ABSTRACT

Optimizing program execution targeted for Graphics Processing Units (GPUs) can be very challenging. Our ability to efficiently map serial code to a GPU or stream processing platform is a time consuming task and is greatly hampered by a lack of detail about the underlying hardware. Programmers are left to attempt trial and error to produce optimized codes.

Recent publication of the underlying instruction set architecture (ISA) of the AMD/ATI GPU has allowed researchers to begin to propose aggressive optimizations. In this work, we present an optimization methodology that utilizes this information to accelerate programs on AMD/ATI GPUs. We start by defining optimization spaces that guide our work. We begin with disassembled machine code and collect program statistics provided by the AMD Graphics Shader Analyzer (GSA) profiling toolset. We explore optimizations targeting three different computing resources: 1) ALUs, 2) fetch bandwidth, and 3) thread usage, and present optimization techniques that consider how to better utilize each resource.

We demonstrate the effectiveness of our proposed optimization approach on an AMD Radeon HD3870 GPU using the Brook+ stream programming language. We describe our optimizations using two commonly-used GPGPU applications that present very different program characteristics and optimization spaces: matrix multiplication and back-projection for medical image reconstruction. Our results show that optimized code can improve performance by 1.45x--6.7x as compared to unoptimized code run on the same GPU platform. The speedup obtained with our optimized implementations are 882x (matrix multiply) and 19x (back-projection) faster as compared with serial implementations run on an Intel 2.66 GHz Core 2 Duo with a 2 GB main memory.

References

  1. AMD. Brook+ Programming Guide, V 1.1 Beta, Brook+ SDK.Google ScholarGoogle Scholar
  2. AMD. R600 Assembly Language Document, Brook+ SDK, 2007.Google ScholarGoogle Scholar
  3. AMD. R600-Family Instruction Set Architecture, Revision 0.31, 2007.Google ScholarGoogle Scholar
  4. AMD. HW Guide, Brook+ SDK, 2008.Google ScholarGoogle Scholar
  5. A. Andersen and A. Kak. Simultaneous algebraic reconstruction technique (SART): a superior implementation of the art algorithm. Ultrason Imaging, 6(1):81--94, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  6. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In SIGGRAPH '04: ACM SIGGRAPH 2004 Papers, pages 777--786, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Do, Z. Liang, W. Karl, T. Brady, and H. Pien. A projection-driven pre-correction technique for iterative reconstruction of helical cone-beam cardiac CT images. In Proceedings of SPIE, volume 6913, page 69132U. SPIE, 2008.Google ScholarGoogle Scholar
  8. K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication. In Graphics Hardware, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. GPGPU Website. www.gpgpu.org.Google ScholarGoogle Scholar
  10. K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley, and A. Lefohn. GPGPU: general purpose computation on graphics hardware. In SIGGRAPH '04: ACM SIGGRAPH 2004 Course Notes, page 33, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. In Proceedings of the IEEE, volume 96, pages 879--899, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  13. S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO '08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pages 195--204, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on GPUs through software-managed cache. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, pages 309--318, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. B. Thibault, K. D. Sauer, C. A. Bouman, and J. Hsieh. A Three-dimensional Statistical Approach to Improved Image Quality for Multislice Helical CT. Med. Physics, 34(11):4526--44, 2007.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Architecture-aware optimization targeting multithreaded stream computing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
        March 2009
        107 pages
        ISBN:9781605585178
        DOI:10.1145/1513895

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 March 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate57of129submissions,44%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader