ABSTRACT
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance.
To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.
- AMD APP SDK: http://developer.amd.com, 2015.Google Scholar
- AMD Graphics Cores Next (GCN) Architecture White paper, 2012.Google Scholar
- NVIDIA CUDA SDK: https://developer.nvidia.com/cuda-downloads. 2015.Google Scholar
- NVIDIA Kepler GK110 white paper. 2012.Google Scholar
- NVIDIA's next generation CUDA compute architecture: Fermi. 2009.Google Scholar
- S. S. Baghsorkhi et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In PPoPP '12. ACM, 2012. Google ScholarDigital Library
- A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS'09, April 2009.Google ScholarCross Ref
- M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on gpus. In IISWC'12, Nov 2012. Google ScholarDigital Library
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC'09, Oct 2009. Google ScholarDigital Library
- S. Che et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC '11. ACM, 2011. Google ScholarDigital Library
- X. Chen et al. Adaptive cache management for energy-efficient gpu computing. In MICRO-47. ACM, 2014. Google ScholarDigital Library
- N. Duong et al. Improving cache management policies using dynamic reuse distances. In MICRO-45, 2012. Google ScholarDigital Library
- J. Gaur et al. Bypass and insertion algorithms for exclusive last-level caches. In ISCA '11. ACM, 2011. Google ScholarDigital Library
- A. Jaleel et al. High performance cache replacement using re-reference interval prediction (RRIP). In Proc of ISCA '10. ACM, 2010. Google ScholarDigital Library
- W. Jia, K. Shaw, and M. Martonosi. MRPB: Memory request prioritization for massively parallel processors. In HPCA'14.Google Scholar
- W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In ICS '12. ACM, 2012. Google ScholarDigital Library
- A. Jog et al. OWL: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In ASPLOS '13. ACM, 2013. Google ScholarDigital Library
- T. L. Johnson and W.-m. W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In ISCA '97. ACM, 1997. Google ScholarDigital Library
- M. Kharbutli and D. Solihin. Counter-based cache replacement and bypassing algorithms. Computers, IEEE Transactions on, April 2008. Google ScholarDigital Library
- S.-Y. Lee and C.-J. Wu. CAWS: Criticality-aware warp scheduling for gpgpu workloads. In PACT '14. ACM, 2014. Google ScholarDigital Library
- C. Li et al. Understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus. In ISPASS'14, March 2014.Google ScholarCross Ref
- V. Narasiman et al. Improving gpu performance via large warps and two-level warp scheduling. In MICRO-44, 2011. Google ScholarDigital Library
- M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: Demand Based Associativity via Global Replacement. In ISCA '05, 2005. Google ScholarDigital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proc of IEEE MICRO-45, 2012. Google ScholarDigital Library
- I. Singh et al. Cache coherence for GPU architectures. In HPCA '13. ACM, 2013. Google ScholarDigital Library
- I.-J. Sung, G. Liu, and W.-M. Hwu. DL: A data layout transformation system for heterogeneous computing. In InPar'12, May 2012.Google ScholarCross Ref
- Y. Tian et al. Adaptive gpu cache bypassing. In GPGPU'15 workshop. ACM, 2015. Google ScholarDigital Library
- S. Wilton and N. Jouppi. CACTI: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5), May 1996.Google ScholarCross Ref
- C.-J. Wu et al. SHiP: Signature-based hit predictor for high performance caching. In MICRO-44. ACM, 2011. Google ScholarDigital Library
- X. Xie et al. An efficient compiler framework for cache bypassing on gpus. In ICCAD'13. Google ScholarDigital Library
- X. Xie et al. Coordinated static and dynamic cache bypassing for gpus. In Proc of HPCA'15, pages 76--88. IEEE, Feb 2015.Google ScholarCross Ref
- Y. Xie and G. H. Loh. PiPP: Promotion/Insertion pseudo-partitioning of multi-core shared caches. In Proc of ISCA'09. ACM, 2009. Google ScholarDigital Library
- Y. Yang et al. Shared memory multiplexing: A novel way to improve gpgpu throughput. In PACT '12. ACM, 2012. Google ScholarDigital Library
Index Terms
- Locality-Driven Dynamic GPU Cache Bypassing
Recommendations
Adaptive GPU cache bypassing
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUsModern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads ...
Counter-Based Cache Replacement and Bypassing Algorithms
Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Adaptive Cache Bypassing for Inclusive Last Level Caches
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed ProcessingCache hierarchy designs, including bypassing, replacement, and the inclusion property, have significant performance impact. Recent works on high performance caches have shown that cache bypassing is an effective technique to enhance the last level cache ...
Comments