skip to main content
10.1145/2751205.2751237acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Locality-Driven Dynamic GPU Cache Bypassing

Published:08 June 2015Publication History

ABSTRACT

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance.

To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

References

  1. AMD APP SDK: http://developer.amd.com, 2015.Google ScholarGoogle Scholar
  2. AMD Graphics Cores Next (GCN) Architecture White paper, 2012.Google ScholarGoogle Scholar
  3. NVIDIA CUDA SDK: https://developer.nvidia.com/cuda-downloads. 2015.Google ScholarGoogle Scholar
  4. NVIDIA Kepler GK110 white paper. 2012.Google ScholarGoogle Scholar
  5. NVIDIA's next generation CUDA compute architecture: Fermi. 2009.Google ScholarGoogle Scholar
  6. S. S. Baghsorkhi et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In PPoPP '12. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS'09, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on gpus. In IISWC'12, Nov 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC'09, Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Che et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC '11. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. Chen et al. Adaptive cache management for energy-efficient gpu computing. In MICRO-47. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Duong et al. Improving cache management policies using dynamic reuse distances. In MICRO-45, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Gaur et al. Bypass and insertion algorithms for exclusive last-level caches. In ISCA '11. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Jaleel et al. High performance cache replacement using re-reference interval prediction (RRIP). In Proc of ISCA '10. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Jia, K. Shaw, and M. Martonosi. MRPB: Memory request prioritization for massively parallel processors. In HPCA'14.Google ScholarGoogle Scholar
  16. W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In ICS '12. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Jog et al. OWL: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In ASPLOS '13. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. L. Johnson and W.-m. W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In ISCA '97. ACM, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Kharbutli and D. Solihin. Counter-based cache replacement and bypassing algorithms. Computers, IEEE Transactions on, April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S.-Y. Lee and C.-J. Wu. CAWS: Criticality-aware warp scheduling for gpgpu workloads. In PACT '14. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Li et al. Understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus. In ISPASS'14, March 2014.Google ScholarGoogle ScholarCross RefCross Ref
  22. V. Narasiman et al. Improving gpu performance via large warps and two-level warp scheduling. In MICRO-44, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: Demand Based Associativity via Global Replacement. In ISCA '05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proc of IEEE MICRO-45, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. I. Singh et al. Cache coherence for GPU architectures. In HPCA '13. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. I.-J. Sung, G. Liu, and W.-M. Hwu. DL: A data layout transformation system for heterogeneous computing. In InPar'12, May 2012.Google ScholarGoogle ScholarCross RefCross Ref
  27. Y. Tian et al. Adaptive gpu cache bypassing. In GPGPU'15 workshop. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Wilton and N. Jouppi. CACTI: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5), May 1996.Google ScholarGoogle ScholarCross RefCross Ref
  29. C.-J. Wu et al. SHiP: Signature-based hit predictor for high performance caching. In MICRO-44. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. X. Xie et al. An efficient compiler framework for cache bypassing on gpus. In ICCAD'13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Xie et al. Coordinated static and dynamic cache bypassing for gpus. In Proc of HPCA'15, pages 76--88. IEEE, Feb 2015.Google ScholarGoogle ScholarCross RefCross Ref
  32. Y. Xie and G. H. Loh. PiPP: Promotion/Insertion pseudo-partitioning of multi-core shared caches. In Proc of ISCA'09. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Yang et al. Shared memory multiplexing: A novel way to improve gpgpu throughput. In PACT '12. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Locality-Driven Dynamic GPU Cache Bypassing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
      June 2015
      446 pages
      ISBN:9781450335591
      DOI:10.1145/2751205

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 June 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICS '15 Paper Acceptance Rate40of160submissions,25%Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader