skip to main content
10.1145/2925426.2926266acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors

Published: 01 June 2016 Publication History

Abstract

Recent commercial chip-multiprocessors (CMPs) have integrated CPU as well as GPU cores on the same die. In today's designs, these cores typically share parts of the memory system resources. However, since the CPU and the GPU cores have vastly different resource requirements, challenging resource partitioning problems arise in such heterogeneous CMPs. In one class of designs, the CPU and the GPU cores share the large on-die last-level SRAM cache. In this paper, we explore mechanisms to dynamically allocate the shared last-level cache space to the CPU and GPU applications in such designs. A CPU core executes an instruction progressively in a pipeline generating memory accesses (for instruction and data) only in a few pipeline stages. On the other hand, a GPU can access different data streams having different semantic meanings and disparate access patterns throughout the rendering pipeline. Such data streams include input vertex, pixel depth, pixel color, texture map, shader instructions, shader data (including shader register spills and fills), etc. Without carefully designed last-level cache management policies, the CPU and the GPU data streams can interfere with each other leading to significant loss in CPU and GPU performance accompanied by degradation in GPU-rendered 3D animation quality. Our proposal dynamically estimates the reuse probabilities of the GPU streams as well as the CPU data by sampling portions of the CPU and GPU working sets and storing the sampled tags in a small working set sample cache. Since the GPU application working sets are typically very large, for this working set sample cache to be effective, it is custom-designed to have large coverage while requiring few tens of kilobytes of storage. We use the estimated reuse probabilities to design shared last-level cache policies for handling hits and misses to reads and writes from both types of cores. Studies on a detailed heterogeneous CMP simulator show that compared to a state-of-the-art baseline with a 16 MB shared last-level cache, our proposal can improve the performance (frame rate or execution cycles, as applicable) of eighteen GPU workloads spanning DirectX and OpenGL game titles as well as CUDA applications by 12% on average and up to 51% while improving the performance of the co-running quad-core CPU workload mixes by 7% on average and up to 19%.

References

[1]
L. A. Belady. A Study of Replacement Algorithms for a Virtual-storage Computer. In IBM Systems Journal, 5(2): 78--101, 1966.
[2]
D. Bouvier, B. Cohen, W. Fry, S. Godey, and M. Mantor. Kabini: An AMD Accelerated Processing Unit System on a Chip. In IEEE Micro, 34(2):22--33, March/April 2014.
[3]
N. Chatterjee, M. O'Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 128--139, November 2014.
[4]
M. Chaudhuri, J. Gaur, N. Bashyam, S. Subramoney, and J. Nuzman. Introducing Hierarchy-awareness in Replacement and Bypass Algorithms for Last-level Caches. In Proceedings of the 21st International Conference on Parallel Architecture and Compilation Techniques, pages 293--304, September 2012.
[5]
M. Chaudhuri. Pseudo-LIFO: The Foundation of a New Family of Replacement Policies for Last-level Caches. In Proceedings of the 42nd International Symposium on Microarchitecture, pages 401--412, December 2009.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization, pages 44--54, October 2009.
[7]
S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization, pages 1--11, December 2010.
[8]
X. Chen, L-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W-M. Hwu. Adaptive Cache Management for Energy-efficient GPU Computing. In Proceedings of the 47th International Symposium on Microarchitecture, pages 343--355, December 2014.
[9]
C. J. Choi, G. H. Park, J. H. Lee, W. C. Park, and T. D. Han. Performance Comparison of Various Cache Systems for Texture Mapping. In Proceedings of the 4th International Conference on High Performance Computing in Asia-Pacific Region, pages 374--379, May 2000.
[10]
M. Cox, N. Bhandari, and M. Shantz. Multi-level Texture Caching for 3D Graphics Hardware. In Proceedings of the 25th International Symposium on Computer Architecture, pages 86--97, June/July 1998.
[11]
M. Demler. Iris Pro Takes On Discrete GPUs. In Microprocessor Report, September 9, 2013.
[12]
G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A Dynamic Optimization Framework for Bulk-synchronous Applications in Heterogeneous Systems. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques, pages 353--364, September 2010.
[13]
N. Doung, D. Zhao, T. Kim, R. Cammarato, M. Valero, and A. V. Veidenbaum. Improving Cache Management Policies Using Dynamic Reuse Distances. In Proceedings of the 45th International Symposium on Microarchitecture, pages 389--400, December 2012.
[14]
J. Gaur, R. Srinivasan, S. Subramoney, and M. Chaudhuri. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th International Symposium on Microarchitecture, pages 395--407, December 2013.
[15]
J. Gaur, M. Chaudhuri, and S. Subramoney. Bypass and Insertion Algorithms for Exclusive Last-level Caches. In Proceedings of the 38th International Symposium on Computer Architecture, pages 81--92, June 2011.
[16]
N. Greene, M. Kass, and G. Miller. Hierarchical Z-buffer Visibility. In Proceedings of the 20th SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques, pages 231--238, August 1993.
[17]
Z. S. Hakura and A. Gupta. The Design and Analysis of a Cache Architecture for Texture Mapping. In Proceedings of the 24th International Symposium on Computer Architecture, pages 108--120, May 1997.
[18]
P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, J. Hong, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The Fourth Generation Intel Core Processor. In IEEE Micro, 34(2):6--20, March/April 2014.
[19]
M. Harris. Dynamic Texturing. Available at http://developer.download.nvidia.com/assets/gamedev/docs/DynamicTexturing.pdf.
[20]
HP Labs. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. Available at http://www.hpl.hp.com/research/mcpat/.
[21]
Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proceedings of the 29th International Symposium on Computer Architecture, pages 209--220, May 2002.
[22]
H. Igehy, M. Eldridge, and P. Hanrahan. Parallel Texture Caching. In Proceedings of the SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, pages 95--106, August 1999.
[23]
H. Igehy, M. Eldridge, and K. Proudfoot. Prefetching in a Texture Cache Architecture. In Proceedings of the SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, pages 133--142, August/September 1998.
[24]
A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer. High Performance Cache Replacement using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th International Symposium on Computer Architecture, pages 60--71, June 2010.
[25]
A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr., and J. Emer. Adaptive Insertion Policies for Managing Shared Caches. In Proceedings of the 17th International Conference on Parallel Architecture and Compilation Techniques, pages 208--219, October 2008.
[26]
W. Jia, K. A. Shaw, and M. Martonosi. MRPB: Memory Request Prioritization for Massively Parallel Processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture, pages 272--283, February 2014.
[27]
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 395--406, March 2013.
[28]
D. Kanter. Intel's Ivy Bridge Graphics Architecture. April 2012. Available at http://www.realworldtech.com/ivy-bridge-gpu/.
[29]
D. Kanter. Intel's Sandy Bridge Graphics Architecture. August 2011. Available at http://www.realworldtech.com/sandy-bridge-gpu/.
[30]
D. Kanter. AMD Fusion Architecture and Llano. June 2011. Available at http://www.realworldtech.com/fusion-llano/.
[31]
O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. Managing GPU Concurrency in Heterogeneous Architectures. In Proceedings of the 47th International Symposium on Microarchitecture, pages 114--126, December 2014.
[32]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 157--166, September 2013.
[33]
G. Keramidas, P. Petoumenos, and S. Kaxiras. Cache Replacement Based on Reuse Distance Prediction. In Proceedings of the 25th International Conference on Computer Design, pages 245--250, October 2007.
[34]
S. Khan, A. Alameldeen, C. Wilkerson, O. Mutlu, and D. A. Jimènez. Improving Cache Performance by Exploiting Read-Write Disparity. In Proceedings of the 20th International Symposium on High Performance Computer Architecture, pages 452--463, February 2014.
[35]
S. Khan, Z. Wang, and D. A. Jimènez. Decoupled Dynamic Cache Segmentation. In Proceedings of the 18th International Symposium on High Performance Computer Architecture, pages 235--246, February 2012.
[36]
S. Khan, Y. Tian, and D. A. Jimènez. Dead Block Replacement and Bypass with a Sampling Predictor. In Proceedings of the 43rd International Symposium on Microarchitecture, pages 175--186, December 2010.
[37]
S. Khan and D. A. Jimènez. Insertion Policy Selection Using Decision Tree Analysis. In Proceedings of the 28th International Conference of Computer Design, pages 106--111, October 2010.
[38]
S. Khan, D. A. Jimènez, D. Burger, and B. Falsafi. Using Dead Blocks as a Virtual Victim Cache. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 489--500, September 2010.
[39]
M. Kharbutli and Y. Solihin. Counter-based Cache Replacement and Bypassing Algorithms. In IEEE Transactions on Computers, 57(4): 433--447, April 2008.
[40]
H. Kim, J. Lee, N. B. Lakshminarayana, J. Sim, J. Lim, and T. Pho. MacSim: A CPU-GPU Heterogeneous Simulation Framework. February 2012. Available at https://code.google.com/p/macsim/.
[41]
A-C. Lai, C. Fide, and B. Falsafi. Dead-block Prediction & Dead-block Correlating Prefetchers. In Proceedings of the 28th International Symposium on Computer Architecture, pages 144--154, June/July 2001.
[42]
S-Y. Lee, A. Arunkumar, and C-J. Wu. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd International Symposium on Computer Architecture, pages 515--527, June 2015.
[43]
J. Lee and H. Kim. TAP: A TLP-aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture, pages 91--102, February 2012.
[44]
D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. Priority-based Cache Allocation in Throughput Processors. In Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, pages 89--100, February 2015.
[45]
H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency. In Proceedings of the 41st International Symposium on Microarchitecture, pages 222--233, November 2008.
[46]
F. D. Luna. Introduction to 3D Game Programming with DirectX 10. Wordware Publishing Inc.
[47]
R. Manikantan, K. Rajan, and R. Govindarajan. Probabilistic Shared Cache Management (PriSM). In Proceedings of the 39th International Symposium on Computer Architecture, pages 428--439, June 2012.
[48]
R. Manikantan, K. Rajan, and R. Govindarajan. NUcache: An Efficient Multicore Cache Organization Based on Next-Use Distance. In Proceedings of the 17th IEEE International Symposium on High-performance Computer Architecture, pages 243--253, February 2011.
[49]
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation Techniques for Storage Hierarchies. In IBM Systems Journal, 9(2): 78--117, 1970.
[50]
V. Mekkat, A. Holey, P-C. Yew, and A. Zhai. Managing Shared Last-level Cache in a Heterogeneous Multicore Processor. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 225--234, September 2013.
[51]
V. Moya, C. Gonzalez, J. Roca, A. Fernandez, and R. Espasa. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pages 231--241, March 2006. Source and traces available at http://attila.ac.upc.edu/wiki/index.php/Main_Page.
[52]
T. Piazza. Intel Processor Graphics. In Symposium on High-Performance Graphics, August 2012.
[53]
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive Insertion Policies for High Performance Caching. In Proceedings of the 34th International Symposium on Computer Architecture, pages 381--391, June 2007.
[54]
M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proceedings of the 39th International Symposium on Microarchitecture, pages 423--432, December 2006.
[55]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-Conscious Wavefront Scheduling. In Proceedings of the 45th International Symposium on Microarchitecture, pages 72--83, December 2012.
[56]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-aware Warp Scheduling. In Proceedings of the 46th International Symposium on Microarchitecture, pages 99--110, December 2013.
[57]
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. In IEEE Computer Architecture Letters, 10(1): 16--19, January-June 2011.
[58]
D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-grain Cache Partitioning. In Proceedings of the 38th International Symposium on Computer Architecture, pages 57--68, June 2011.
[59]
V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry. The Evicted-address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pages 355--366, September 2012.
[60]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 45--57, October 2002.
[61]
A. L. Shimpi. Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested. June 2013. Available at http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested.
[62]
L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu. The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory. In Proceedings of the 48th International Symposium on Microarchitecture, pages 62--75, December 2015.
[63]
R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proceedings of the 21st International Conference on Parallel Architecture and Compilation Techniques, pages 335--344, September 2012.
[64]
A. Vartanian, J-L. Bechennec, and N. Drach-Temam. Evaluation of High Performance Multicache Parallel Texture Mapping. In Proceedings of the 12th International Conference on Supercomputing, pages 289--296, July 1998.
[65]
J. Walton. The AMD Trinity Review (A10-4600M): A New Hope. May 2012. Available at http://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope/.
[66]
C-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr., and J. Emer. SHiP: Signature-Based Hit Predictor for High Performance Caching. In Proceedings of the 44th International Symposium on Microarchitecture, pages 430--441, December 2011.
[67]
Y. Xie and G. H. Loh. PIPP: Promotion/Insertion Pseudo-partitioning of Multi-core Shared Caches. In Proceedings of the 36th International Symposium on Computer Architecture, pages 174--183, June 2009.
[68]
M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts. A Fully Integrated Multi-CPU, GPU, and Memory Controller 32 nm Processor. In Proceedings of the International Solid-State Circuits Conference, pages 264--266, February 2011.
[69]
3D Mark Benchmark. http://www.3dmark.com/.

Cited By

View all
  • (2024)Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU ArchitecturesSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00017(1-15)Online publication date: 17-Nov-2024
  • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
  • (2018)Tail-PASS: Resource-Based Cache Management for Tiled Graphics Rendering Hardware2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom)10.1109/BDCloud.2018.00022(55-63)Online publication date: Dec-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CPU-GPU integration
  2. shared last-level cache
  3. temporal reuse

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU ArchitecturesSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00017(1-15)Online publication date: 17-Nov-2024
  • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
  • (2018)Tail-PASS: Resource-Based Cache Management for Tiled Graphics Rendering Hardware2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom)10.1109/BDCloud.2018.00022(55-63)Online publication date: Dec-2018
  • (2018)SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU---GPU heterogeneous architecturesThe Journal of Supercomputing10.1007/s11227-018-2389-374:7(3388-3414)Online publication date: 1-Jul-2018
  • (2017)Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core ProcessorsACM Transactions on Embedded Computing Systems10.1145/312654016:5s(1-23)Online publication date: 27-Sep-2017
  • (2017)Improving CPU Performance Through Dynamic GPU Access Throttling in CPU-GPU Heterogeneous Processors2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.37(18-29)Online publication date: May-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media