ABSTRACT
Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space. In this paper we discover that individual load instructions in a warp exhibit four different types of data locality behavior: (1) data brought by a warp load instruction is used only once, which is classified as streaming data (2) data brought by a warp load is reused multiple times within the same warp, called intra-warp locality (3) data brought by a warp is reused multiple times but across different warps, called inter-warp locality (4) and some data exhibit both a mix of intra- and inter-warp locality. Furthermore, each load instruction exhibits consistently the same locality type across all warps within a GPU kernel. Based on this discovery we argue that cache management must be done using per-load locality type information, rather than applying warp-wide cache management policies. We propose Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp. APCM then uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization. Using an extensive set of simulations we show that APCM improves performance of GPUs by 34% for cache sensitive applications while saving 27% of energy consumption over baseline GPU.
- FreePDK process design kit. http://www.eda.ncsu.edu/wiki/FreePDKGoogle Scholar
- GPGPU-sim manual. http://gpgpu-sim.org/manualGoogle Scholar
- Mohammad Abdel-Majeed, Hyeran Jeon, Alireza Shafaei, Massoud Pedram, and Murali Annavaram. 2017. Pilot Register File: Energy Efficient Partitioned Register File for GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA '17).Google Scholar
- AMD. AMD Graphics Cores Next (GCN) Architecture.Google Scholar
- Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '09). 163--174.Google Scholar
- Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A Quantitative Study of Irregular Programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '12). IEEE Computer Society, Washington, DC, USA, 141--151. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC '09). IEEE Computer Society, Washington, DC, USA, 44--54. Google ScholarDigital Library
- Shuai Che and Kevin Skadron. 2014. BenchFriend: Correlating the Performance of GPU Benchmarks. Int. J. High Perform. Comput. Appl. 28, 2 (May 2014), 238--250. Google ScholarDigital Library
- Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343--355. Google ScholarDigital Library
- Hyojin Choi, Jaewoo Ahn, and Wonyong Sung. 2012. Reducing Off-chip Memory Traffic by Selective Cache Management Scheme in GPGPUs. In Proceedings of the Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 110--119. Google ScholarDigital Library
- Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York, NY, USA, 63--74. Google ScholarDigital Library
- Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving Cache Management Policies Using Dynamic Reuse Distances. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 389--400. Google ScholarDigital Library
- Kayvon Fatahalian and Mike Houston. 2008. A Closer Look at GPUs. Communication of the ACM 51, 10 (Oct. 2008), 50--57. Google ScholarDigital Library
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40). IEEE Computer Society, Washington, DC, USA, 407--420. Google ScholarDigital Library
- Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 260--269. Google ScholarDigital Library
- Hynix. 1Gb GDDR5 SGRAM H5GQ1H24AFR Specification. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdfGoogle Scholar
- Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 60--71. Google ScholarDigital Library
- Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In Proceedings of the ACM International Conference on Supercomputing (ICS '12). ACM, New York, NY, USA, 15--24. Google ScholarDigital Library
- Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 157--166. http://dl.acm.org/citation.cfm?id=2523721.2523745 Google ScholarDigital Library
- Mazen Kharbutli and Yan Solihin. 2008. Counter-Based Cache Replacement and Bypassing Algorithms. IEEE Trans. Comput. 57, 4 (April 2008), 433--447. Google ScholarDigital Library
- Gunjae Koo, Hyeran Jeon, and Murali Annavaram. 2015. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '15). IEEE Computer Society, Washington, DC, USA, 120--129. Google ScholarDigital Library
- Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43). IEEE Computer Society, Washington, DC, USA, 213--224. Google ScholarDigital Library
- Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In IEEE International Symposium on High Performance Computer Architecture (HPCA '14). 260--271.Google ScholarCross Ref
- Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 487--498. Google ScholarDigital Library
- Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 67--77. Google ScholarDigital Library
- Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O'Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Keckler. 2015. Priority-Based Cache Allocation in Throughput Processors. In IEEE International Symposium on High Performance Computer Architecture (HPCA '15). 89--100.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories.Google Scholar
- NVIDIA. NVIDIA CUDA C Programming Guide.Google Scholar
- NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi.Google Scholar
- NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.Google Scholar
- Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, and Murali Annavaram. 2016. APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 191--203. Google ScholarDigital Library
- Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86--98. Google ScholarDigital Library
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 72--83. Google ScholarDigital Library
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 99--110. Google ScholarDigital Library
- Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE: Adaptive Prefetching on GPUs for Energy Efficiency. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 73--82. http://dl.acm.org/citation.cfm?id=2523721.2523735 Google ScholarDigital Library
- John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwe. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign.Google Scholar
- Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jimenez. 2015. Adaptive GPU Cache Bypassing Categories and Subject Descriptors. In Proceedings of the Workshop on General Purpose Processing Using GPUs (GPGPU-8). 36--47. Google ScholarDigital Library
- Steven J. E. Wilton and Norman P. Jouppi. 1996. CACTI: An Enhanced Cache Access and Cycle Time Model. IEEE Journal of Solid-State Circuits 31, 5 (May 1996), 677--688.Google ScholarCross Ref
- Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD '13). IEEE Press, Piscataway, NJ, USA, 516--523. http://dl.acm.org/citation.cfm?id=2561828.2561929 Google ScholarDigital Library
- Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA '15). 76--88.Google ScholarCross Ref
- Qiumin Xu, Hyeran Jeon, and Murali Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '14). 140--149.Google ScholarCross Ref
- George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, New York, NY, USA, 34--44. Google ScholarDigital Library
- Mohammed Zackriya V and Harish M. Kittur. 2016. Precharge-Free, Low-Power Content-Addressable Memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems PP, 99 (2016), 1--8.Google Scholar
Index Terms
- Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
Recommendations
Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
ISCA'17Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data ...
Efficient utilization of GPGPU cache hierarchy
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUsRecent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe ...
Adaptive Cache Management for Energy-Efficient GPU Computing
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on MicroarchitectureWith the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been ...
Comments