skip to main content
10.1145/3079856.3080239acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Public Access

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

Published:24 June 2017Publication History

ABSTRACT

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space. In this paper we discover that individual load instructions in a warp exhibit four different types of data locality behavior: (1) data brought by a warp load instruction is used only once, which is classified as streaming data (2) data brought by a warp load is reused multiple times within the same warp, called intra-warp locality (3) data brought by a warp is reused multiple times but across different warps, called inter-warp locality (4) and some data exhibit both a mix of intra- and inter-warp locality. Furthermore, each load instruction exhibits consistently the same locality type across all warps within a GPU kernel. Based on this discovery we argue that cache management must be done using per-load locality type information, rather than applying warp-wide cache management policies. We propose Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp. APCM then uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization. Using an extensive set of simulations we show that APCM improves performance of GPUs by 34% for cache sensitive applications while saving 27% of energy consumption over baseline GPU.

References

  1. FreePDK process design kit. http://www.eda.ncsu.edu/wiki/FreePDKGoogle ScholarGoogle Scholar
  2. GPGPU-sim manual. http://gpgpu-sim.org/manualGoogle ScholarGoogle Scholar
  3. Mohammad Abdel-Majeed, Hyeran Jeon, Alireza Shafaei, Massoud Pedram, and Murali Annavaram. 2017. Pilot Register File: Energy Efficient Partitioned Register File for GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA '17).Google ScholarGoogle Scholar
  4. AMD. AMD Graphics Cores Next (GCN) Architecture.Google ScholarGoogle Scholar
  5. Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '09). 163--174.Google ScholarGoogle Scholar
  6. Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A Quantitative Study of Irregular Programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '12). IEEE Computer Society, Washington, DC, USA, 141--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC '09). IEEE Computer Society, Washington, DC, USA, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shuai Che and Kevin Skadron. 2014. BenchFriend: Correlating the Performance of GPU Benchmarks. Int. J. High Perform. Comput. Appl. 28, 2 (May 2014), 238--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hyojin Choi, Jaewoo Ahn, and Wonyong Sung. 2012. Reducing Off-chip Memory Traffic by Selective Cache Management Scheme in GPGPUs. In Proceedings of the Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 110--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York, NY, USA, 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving Cache Management Policies Using Dynamic Reuse Distances. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 389--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kayvon Fatahalian and Mike Houston. 2008. A Closer Look at GPUs. Communication of the ACM 51, 10 (Oct. 2008), 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40). IEEE Computer Society, Washington, DC, USA, 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hynix. 1Gb GDDR5 SGRAM H5GQ1H24AFR Specification. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdfGoogle ScholarGoogle Scholar
  17. Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In Proceedings of the ACM International Conference on Supercomputing (ICS '12). ACM, New York, NY, USA, 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 157--166. http://dl.acm.org/citation.cfm?id=2523721.2523745 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mazen Kharbutli and Yan Solihin. 2008. Counter-Based Cache Replacement and Bypassing Algorithms. IEEE Trans. Comput. 57, 4 (April 2008), 433--447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gunjae Koo, Hyeran Jeon, and Murali Annavaram. 2015. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '15). IEEE Computer Society, Washington, DC, USA, 120--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43). IEEE Computer Society, Washington, DC, USA, 213--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In IEEE International Symposium on High Performance Computer Architecture (HPCA '14). 260--271.Google ScholarGoogle ScholarCross RefCross Ref
  24. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 487--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 67--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O'Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Keckler. 2015. Priority-Based Cache Allocation in Throughput Processors. In IEEE International Symposium on High Performance Computer Architecture (HPCA '15). 89--100.Google ScholarGoogle Scholar
  27. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories.Google ScholarGoogle Scholar
  28. NVIDIA. NVIDIA CUDA C Programming Guide.Google ScholarGoogle Scholar
  29. NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi.Google ScholarGoogle Scholar
  30. NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.Google ScholarGoogle Scholar
  31. Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, and Murali Annavaram. 2016. APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 191--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 72--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE: Adaptive Prefetching on GPUs for Energy Efficiency. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 73--82. http://dl.acm.org/citation.cfm?id=2523721.2523735 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwe. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign.Google ScholarGoogle Scholar
  37. Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jimenez. 2015. Adaptive GPU Cache Bypassing Categories and Subject Descriptors. In Proceedings of the Workshop on General Purpose Processing Using GPUs (GPGPU-8). 36--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Steven J. E. Wilton and Norman P. Jouppi. 1996. CACTI: An Enhanced Cache Access and Cycle Time Model. IEEE Journal of Solid-State Circuits 31, 5 (May 1996), 677--688.Google ScholarGoogle ScholarCross RefCross Ref
  39. Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD '13). IEEE Press, Piscataway, NJ, USA, 516--523. http://dl.acm.org/citation.cfm?id=2561828.2561929 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA '15). 76--88.Google ScholarGoogle ScholarCross RefCross Ref
  41. Qiumin Xu, Hyeran Jeon, and Murali Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '14). 140--149.Google ScholarGoogle ScholarCross RefCross Ref
  42. George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, New York, NY, USA, 34--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mohammed Zackriya V and Harish M. Kittur. 2016. Precharge-Free, Low-Power Content-Addressable Memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems PP, 99 (2016), 1--8.Google ScholarGoogle Scholar

Index Terms

  1. Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
      June 2017
      736 pages
      ISBN:9781450348928
      DOI:10.1145/3079856

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 June 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      ISCA '17 Paper Acceptance Rate54of322submissions,17%Overall Acceptance Rate543of3,203submissions,17%

      Upcoming Conference

      ISCA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader