Locality-Driven Dynamic GPU Cache Bypassing

Authors:
Chao Li

North Carolina State University, Raleigh, NC, USA

North Carolina State University, Raleigh, NC, USA
View Profile

,
Shuaiwen Leon Song

Pacific Northwest National Lab, Richland, WA, USA

Pacific Northwest National Lab, Richland, WA, USA
View Profile

,
Hongwen Dai

North Carolina State University, Raleigh, NC, USA

North Carolina State University, Raleigh, NC, USA
View Profile

,
Albert Sidelnik

NVIDIA Research, Santa Clara, CA, USA

NVIDIA Research, Santa Clara, CA, USA
View Profile

,
Siva Kumar Sastry Hari

NVIDIA Research, Santa Clara, CA, USA

NVIDIA Research, Santa Clara, CA, USA
View Profile

,
Huiyang Zhou

North Carolina State University, Raleigh, NC, USA

North Carolina State University, Raleigh, NC, USA
View Profile

ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingJune 2015Pages 67–77https://doi.org/10.1145/2751205.2751237

Published:08 June 2015Publication History

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Pages 67–77

ABSTRACT

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance.

To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

References

AMD APP SDK: http://developer.amd.com, 2015.Google Scholar
AMD Graphics Cores Next (GCN) Architecture White paper, 2012.Google Scholar
NVIDIA CUDA SDK: https://developer.nvidia.com/cuda-downloads. 2015.Google Scholar
NVIDIA Kepler GK110 white paper. 2012.Google Scholar
NVIDIA's next generation CUDA compute architecture: Fermi. 2009.Google Scholar
S. S. Baghsorkhi et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In PPoPP '12. ACM, 2012. Google ScholarDigital Library
A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS'09, April 2009.Google ScholarCross Ref
M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on gpus. In IISWC'12, Nov 2012. Google ScholarDigital Library
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC'09, Oct 2009. Google ScholarDigital Library
S. Che et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC '11. ACM, 2011. Google ScholarDigital Library
X. Chen et al. Adaptive cache management for energy-efficient gpu computing. In MICRO-47. ACM, 2014. Google ScholarDigital Library
N. Duong et al. Improving cache management policies using dynamic reuse distances. In MICRO-45, 2012. Google ScholarDigital Library
J. Gaur et al. Bypass and insertion algorithms for exclusive last-level caches. In ISCA '11. ACM, 2011. Google ScholarDigital Library
A. Jaleel et al. High performance cache replacement using re-reference interval prediction (RRIP). In Proc of ISCA '10. ACM, 2010. Google ScholarDigital Library
W. Jia, K. Shaw, and M. Martonosi. MRPB: Memory request prioritization for massively parallel processors. In HPCA'14.Google Scholar
W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In ICS '12. ACM, 2012. Google ScholarDigital Library
A. Jog et al. OWL: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In ASPLOS '13. ACM, 2013. Google ScholarDigital Library
T. L. Johnson and W.-m. W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In ISCA '97. ACM, 1997. Google ScholarDigital Library
M. Kharbutli and D. Solihin. Counter-based cache replacement and bypassing algorithms. Computers, IEEE Transactions on, April 2008. Google ScholarDigital Library
S.-Y. Lee and C.-J. Wu. CAWS: Criticality-aware warp scheduling for gpgpu workloads. In PACT '14. ACM, 2014. Google ScholarDigital Library
C. Li et al. Understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus. In ISPASS'14, March 2014.Google ScholarCross Ref
V. Narasiman et al. Improving gpu performance via large warps and two-level warp scheduling. In MICRO-44, 2011. Google ScholarDigital Library
M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: Demand Based Associativity via Global Replacement. In ISCA '05, 2005. Google ScholarDigital Library
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proc of IEEE MICRO-45, 2012. Google ScholarDigital Library
I. Singh et al. Cache coherence for GPU architectures. In HPCA '13. ACM, 2013. Google ScholarDigital Library
I.-J. Sung, G. Liu, and W.-M. Hwu. DL: A data layout transformation system for heterogeneous computing. In InPar'12, May 2012.Google ScholarCross Ref
Y. Tian et al. Adaptive gpu cache bypassing. In GPGPU'15 workshop. ACM, 2015. Google ScholarDigital Library
S. Wilton and N. Jouppi. CACTI: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5), May 1996.Google ScholarCross Ref
C.-J. Wu et al. SHiP: Signature-based hit predictor for high performance caching. In MICRO-44. ACM, 2011. Google ScholarDigital Library
X. Xie et al. An efficient compiler framework for cache bypassing on gpus. In ICCAD'13. Google ScholarDigital Library
X. Xie et al. Coordinated static and dynamic cache bypassing for gpus. In Proc of HPCA'15, pages 76--88. IEEE, Feb 2015.Google ScholarCross Ref
Y. Xie and G. H. Loh. PiPP: Promotion/Insertion pseudo-partitioning of multi-core shared caches. In Proc of ISCA'09. ACM, 2009. Google ScholarDigital Library
Y. Yang et al. Shared memory multiplexing: A novel way to improve gpgpu throughput. In PACT '12. ACM, 2012. Google ScholarDigital Library

Index Terms

Locality-Driven Dynamic GPU Cache Bypassing
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Adaptive GPU cache bypassing
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads ...
Read More
Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Read More
Adaptive Cache Bypassing for Inclusive Last Level Caches
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Cache hierarchy designs, including bypassing, replacement, and the inclusion property, have significant performance impact. Recent works on high performance caches have shown that cache bypassing is an effective technique to enhance the last level cache ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
June 2015
446 pages
ISBN:9781450335591
DOI:10.1145/2751205
General Chair:
Laxmi N. Bhuyan
University of California, Riverside
,
Program Chairs:
Fred Chong
University of California, Santa Barbara
,
Vivek Sarkar
Rice University
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache bypassing
gpu architecture optimization
locality
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '15 Paper Acceptance Rate40of160submissions,25%Overall Acceptance Rate584of2,055submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 95
  Total Citations
  View Citations
- 1,531
  Total Downloads
- Downloads (Last 12 months)250
- Downloads (Last 6 weeks)31
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Locality-Driven Dynamic GPU Cache Bypassing

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adaptive GPU cache bypassing

Counter-Based Cache Replacement and Bypassing Algorithms

Adaptive Cache Bypassing for Inclusive Last Level Caches