research-article

Public Access

Tag-Split Cache for Efficient GPGPU Cache Utilization

Authors:

Shuaiwen Leon Song,

Eddy Z. ZhangAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 43, Pages 1 - 12

https://doi.org/10.1145/2925426.2926253

Published: 01 June 2016 Publication History

Abstract

Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using smaller cache lines could improve cache space utilization, but it also frequently suffers from significant performance loss by introducing large amount of extra cache requests. In this work, we propose a novel cache design named tag-split cache (TSC) that enables fine-grained cache storage to address the problem of cache space underutilization while keeping memory request number unchanged. TSC divides tag into two parts to reduce storage overhead, and it supports multiple cache line replacement in one cycle. TSC can also automatically adjust cache storage granularity to avoid performance loss for applications with good spatial locality. Our evaluation shows that TSC improves the baseline cache performance by 17.2% on average across a wide range of applications. It also out-performs other previous techniques significantly.

References

[1]

AMD Graphics Cores Next (GCN) Architecture White paper, 2012.

[2]

NVIDIA Kepler GK110 white paper. 2012.

[3]

NVIDIA's next generation CUDA compute architecture: Fermi. 2009.

[4]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163--174, April 2009.

[5]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, Oct 2009.

Digital Library

[6]

X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu. Adaptive cache management for energy-efficient gpu computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 343--355, Washington, DC, USA, 2014. IEEE Computer Society.

Digital Library

[7]

J. Gaur, R. Srinivasan, S. Subramoney, and M. Chaudhuri. Efficient management of last-level caches in graphics processors for 3d scene rendering workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 395--407, New York, NY, USA, 2013. ACM.

Digital Library

[8]

A. González, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In Proceedings of the 9th International Conference on Supercomputing, ICS '95, pages 338--347, New York, NY, USA, 1995. ACM.

Digital Library

[9]

K. Inoue, K. Kai, and K. Murakami. Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged dram/logic lsis. In High-Performance Computer Architecture, 1999. Proceedings. Fifth International Symposium On, pages 218--222, Jan 1999.

Digital Library

[10]

A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 60--71, New York, NY, USA, 2010. ACM.

Digital Library

[11]

W. Jia, K. Shaw, and M. Martonosi. Mrpb: Memory request prioritization for massively parallel processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 272--283, Feb 2014.

[12]

W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 15--24, New York, NY, USA, 2012. ACM.

Digital Library

[13]

O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelism for gpgpus. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 157--166, Piscataway, NJ, USA, 2013. IEEE Press.

Digital Library

[14]

S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and L. Shannon. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 376--388, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[15]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 487--498, New York, NY, USA, 2013. ACM.

Digital Library

[16]

C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 67--77, New York, NY, USA, 2015. ACM.

Digital Library

[17]

L. Li, D. Tong, Z. Xie, J. Lu, and X. Cheng. Optimal bypass monitor for high performance last-level caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 315--324, New York, NY, USA, 2012. ACM.

Digital Library

[18]

S. Li, D. H. Yoon, K. Chen, J. Zhao, J. H. Ahn, J. Brockman, Y. Xie, and N. Jouppi. Mage: Adaptive granularity and ecc for resilient and power efficient memory systems. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--11, Nov 2012.

Digital Library

[19]

J. S. Liptay. Structural aspects of the system/360 model 85: II the cache. IBM Syst. J., 7(1):15--21, Mar. 1968.

Digital Library

[20]

M. M. K. Martin, M. D. Hill, and D. J. Sorin. Why on-chip cache coherence is here to stay. Commun. ACM, 55(7):78--89, July 2012.

Digital Library

[21]

X. Mei and X. Chu. Dissecting gpu memory hierarchy through microbenchmarking. arXiv preprint arXiv:1509.02308, 2015.

[22]

C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal. A detailed gpu cache model based on reuse distance theory. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 37--48, Feb 2014.

[23]

NVIDIA. Whitepaper - nvidia next generation cuda compute architecture: Fermi. NVIDIA, 2011.

[24]

H. Packard. Inside the intel® itanium® 2 processor. Technical White Paper, 2002.

[25]

M. Qureshi, M. Suleman, and Y. Patt. Line distillation: Increasing cache capacity by filtering unused words in cache lines. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 250--259, Feb 2007.

Digital Library

[26]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 381--391, New York, NY, USA, 2007. ACM.

Digital Library

[27]

M. Rhu, M. Sullivan, J. Leng, and M. Erez. A locality-aware memory hierarchy for energy-efficient gpu architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 86--98, New York, NY, USA, 2013. ACM.

Digital Library

[28]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[29]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 99--110, New York, NY, USA, 2013. ACM.

Digital Library

[30]

J. B. Rothman and A. J. Smith. The pool of subsectors cache design. In Proceedings of the 13th International Conference on Supercomputing, ICS '99, pages 31--42, New York, NY, USA, 1999. ACM.

Digital Library

[31]

A. Sethia, D. Jamshidi, and S. Mahlke. Mascar: Speeding up gpu warps by reducing memory pitstops. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 174--185, Feb 2015.

[32]

A. Seznec. Decoupled sectored caches: Conciliating low tag implementation cost. In Proceedings of the 21st Annual International Symposium on Computer Architecture, ISCA '94, pages 384--393, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.

Digital Library

[33]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.

[34]

S. Thoziyoor, N. Muralimanohar, J. Ahn, and N. Jouppi. Cacti 5.1. HP Laboratories, April, 2, 2008.

[35]

A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache line size to application behavior. In Proceedings of the 13th International Conference on Supercomputing, ICS '99, pages 145--154, New York, NY, USA, 1999. ACM.

Digital Library

[36]

X. Xie, Y. Liang, G. Sun, and D. Chen. An efficient compiler framework for cache bypassing on gpus. In Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on, pages 516--523, Nov 2013.

Digital Library

[37]

X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static and dynamic cache bypassing for gpus. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 76--88, Feb 2015.

[38]

D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 295--306, New York, NY, USA, 2011. ACM.

Digital Library

[39]

D. H. Yoon, M. K. Jeong, M. Sullivan, and M. Erez. The dynamic granularity memory system. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 548--559, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

Cited By

Manocha AAragon JMartonosi M(2023)Graphfire: Synergizing Fetch, Insertion, and Replacement Policies for Graph AnalyticsIEEE Transactions on Computers10.1109/TC.2022.315752572:1(291-304)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TC.2022.3157525
Manocha AYan ZTureci EAragon JNellans DMartonosi M(2022)The Implications of Page Size Management on Graph Analytics2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00026(199-214)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00026
Gu YChen LEigenmann RDing CMcKee S(2019)Dynamically linked MSHRs for adaptive miss handling in GPUsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330390(510-521)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330390
Show More Cited By

Tag-Split Cache for Efficient GPGPU Cache Utilization
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data ...
Efficient utilization of GPGPU cache hierarchy
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe ...
Filter cache: filtering useless cache blocks for a small but efficient shared last-level cache
Abstract
Although the shared last-level cache (SLLC) occupies a significant portion of multicore CPU chip die area, more than 59% of SLLC cache blocks are not reused during their lifetime. If we can filter out these useless blocks from SLLC, we can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
1,122
Total Downloads

Downloads (Last 12 months)404
Downloads (Last 6 weeks)23

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Manocha AAragon JMartonosi M(2023)Graphfire: Synergizing Fetch, Insertion, and Replacement Policies for Graph AnalyticsIEEE Transactions on Computers10.1109/TC.2022.315752572:1(291-304)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TC.2022.3157525
Manocha AYan ZTureci EAragon JNellans DMartonosi M(2022)The Implications of Page Size Management on Graph Analytics2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00026(199-214)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00026
Gu YChen LEigenmann RDing CMcKee S(2019)Dynamically linked MSHRs for adaptive miss handling in GPUsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330390(510-521)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330390
Oh YKoo GAnnavaram MRo WManne SHunter HAltman E(2019)LinebackerProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322222(183-196)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322222
Shen DSong SLi ALiu X(2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168831(214-227)Online publication date: 2018
https://doi.org/10.1145/3179541.3168831
Shen DSong SLi ALiu XKnoop JSchordan MJohnson TO'Boyle M(2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168831(214-227)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168831
Oh YYoon MSong WRo WOskin MInoue K(2018)FineRegProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00037(364-376)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00037
Li ASong SLiu WLiu XKumar ACorporaal H(2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGARCH Computer Architecture News10.1145/3093337.303770945:1(297-311)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3093337.3037709
Li ASong SLiu WLiu XKumar ACorporaal H(2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGPLAN Notices10.1145/3093336.303770952:4(297-311)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3093336.3037709
Li ASong SLiu WLiu XKumar ACorporaal H(2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGOPS Operating Systems Review10.1145/3093315.303770951:2(297-311)Online publication date: 4-Apr-2017
https://doi.org/10.1145/3093315.3037709
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten