Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

Authors:
Gunjae Koo

University of Southern California

University of Southern California
View Profile

,
Yunho Oh

Yonsei University

Yonsei University
View Profile

,
Won Woo Ro

Yonsei University

Yonsei University
View Profile

,
Murali Annavaram

University of Southern California

University of Southern California
View Profile

ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureJune 2017Pages 307–319https://doi.org/10.1145/3079856.3080239

Published:24 June 2017Publication History

ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Pages 307–319

ABSTRACT

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space. In this paper we discover that individual load instructions in a warp exhibit four different types of data locality behavior: (1) data brought by a warp load instruction is used only once, which is classified as streaming data (2) data brought by a warp load is reused multiple times within the same warp, called intra-warp locality (3) data brought by a warp is reused multiple times but across different warps, called inter-warp locality (4) and some data exhibit both a mix of intra- and inter-warp locality. Furthermore, each load instruction exhibits consistently the same locality type across all warps within a GPU kernel. Based on this discovery we argue that cache management must be done using per-load locality type information, rather than applying warp-wide cache management policies. We propose Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp. APCM then uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization. Using an extensive set of simulations we show that APCM improves performance of GPUs by 34% for cache sensitive applications while saving 27% of energy consumption over baseline GPU.

References

FreePDK process design kit. http://www.eda.ncsu.edu/wiki/FreePDKGoogle Scholar
GPGPU-sim manual. http://gpgpu-sim.org/manualGoogle Scholar
Mohammad Abdel-Majeed, Hyeran Jeon, Alireza Shafaei, Massoud Pedram, and Murali Annavaram. 2017. Pilot Register File: Energy Efficient Partitioned Register File for GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA '17).Google Scholar
AMD. AMD Graphics Cores Next (GCN) Architecture.Google Scholar
Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '09). 163--174.Google Scholar
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A Quantitative Study of Irregular Programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '12). IEEE Computer Society, Washington, DC, USA, 141--151. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC '09). IEEE Computer Society, Washington, DC, USA, 44--54. Google ScholarDigital Library
Shuai Che and Kevin Skadron. 2014. BenchFriend: Correlating the Performance of GPU Benchmarks. Int. J. High Perform. Comput. Appl. 28, 2 (May 2014), 238--250. Google ScholarDigital Library
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343--355. Google ScholarDigital Library
Hyojin Choi, Jaewoo Ahn, and Wonyong Sung. 2012. Reducing Off-chip Memory Traffic by Selective Cache Management Scheme in GPGPUs. In Proceedings of the Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 110--119. Google ScholarDigital Library
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York, NY, USA, 63--74. Google ScholarDigital Library
Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving Cache Management Policies Using Dynamic Reuse Distances. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 389--400. Google ScholarDigital Library
Kayvon Fatahalian and Mike Houston. 2008. A Closer Look at GPUs. Communication of the ACM 51, 10 (Oct. 2008), 50--57. Google ScholarDigital Library
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40). IEEE Computer Society, Washington, DC, USA, 407--420. Google ScholarDigital Library
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 260--269. Google ScholarDigital Library
Hynix. 1Gb GDDR5 SGRAM H5GQ1H24AFR Specification. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdfGoogle Scholar
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 60--71. Google ScholarDigital Library
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In Proceedings of the ACM International Conference on Supercomputing (ICS '12). ACM, New York, NY, USA, 15--24. Google ScholarDigital Library
Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 157--166. http://dl.acm.org/citation.cfm?id=2523721.2523745 Google ScholarDigital Library
Mazen Kharbutli and Yan Solihin. 2008. Counter-Based Cache Replacement and Bypassing Algorithms. IEEE Trans. Comput. 57, 4 (April 2008), 433--447. Google ScholarDigital Library
Gunjae Koo, Hyeran Jeon, and Murali Annavaram. 2015. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '15). IEEE Computer Society, Washington, DC, USA, 120--129. Google ScholarDigital Library
Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43). IEEE Computer Society, Washington, DC, USA, 213--224. Google ScholarDigital Library
Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In IEEE International Symposium on High Performance Computer Architecture (HPCA '14). 260--271.Google ScholarCross Ref
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 487--498. Google ScholarDigital Library
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 67--77. Google ScholarDigital Library
Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O'Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Keckler. 2015. Priority-Based Cache Allocation in Throughput Processors. In IEEE International Symposium on High Performance Computer Architecture (HPCA '15). 89--100.Google Scholar
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories.Google Scholar
NVIDIA. NVIDIA CUDA C Programming Guide.Google Scholar
NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi.Google Scholar
NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.Google Scholar
Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, and Murali Annavaram. 2016. APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 191--203. Google ScholarDigital Library
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86--98. Google ScholarDigital Library
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 72--83. Google ScholarDigital Library
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 99--110. Google ScholarDigital Library
Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE: Adaptive Prefetching on GPUs for Energy Efficiency. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 73--82. http://dl.acm.org/citation.cfm?id=2523721.2523735 Google ScholarDigital Library
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwe. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign.Google Scholar
Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jimenez. 2015. Adaptive GPU Cache Bypassing Categories and Subject Descriptors. In Proceedings of the Workshop on General Purpose Processing Using GPUs (GPGPU-8). 36--47. Google ScholarDigital Library
Steven J. E. Wilton and Norman P. Jouppi. 1996. CACTI: An Enhanced Cache Access and Cycle Time Model. IEEE Journal of Solid-State Circuits 31, 5 (May 1996), 677--688.Google ScholarCross Ref
Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD '13). IEEE Press, Piscataway, NJ, USA, 516--523. http://dl.acm.org/citation.cfm?id=2561828.2561929 Google ScholarDigital Library
Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA '15). 76--88.Google ScholarCross Ref
Qiumin Xu, Hyeran Jeon, and Murali Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '14). 140--149.Google ScholarCross Ref
George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, New York, NY, USA, 34--44. Google ScholarDigital Library
Mohammed Zackriya V and Harish M. Kittur. 2016. Precharge-Free, Low-Power Content-Addressable Memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems PP, 99 (2016), 1--8.Google Scholar

Index Terms

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
ISCA'17

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data ...
Read More
Efficient utilization of GPGPU cache hierarchy
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe ...
Read More
Adaptive Cache Management for Energy-Efficient GPU Computing
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
June 2017
736 pages
ISBN:9781450348928
DOI:10.1145/3079856
ACM SIGARCH Computer Architecture News Volume 45, Issue 2
ISCA'17
May 2017
715 pages
ISSN:0163-5964
DOI:10.1145/3140659
Editor:
Babak Falsafi
Interim
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
cache management
memory access patterns
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ISCA '17 Paper Acceptance Rate54of322submissions,17%Overall Acceptance Rate543of3,203submissions,17%
More
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 2,245
  Total Downloads
- Downloads (Last 12 months)491
- Downloads (Last 6 weeks)86
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

Efficient utilization of GPGPU cache hierarchy

Adaptive Cache Management for Energy-Efficient GPU Computing