skip to main content
10.1145/2925426.2926253acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Tag-Split Cache for Efficient GPGPU Cache Utilization

Published: 01 June 2016 Publication History

Abstract

Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using smaller cache lines could improve cache space utilization, but it also frequently suffers from significant performance loss by introducing large amount of extra cache requests. In this work, we propose a novel cache design named tag-split cache (TSC) that enables fine-grained cache storage to address the problem of cache space underutilization while keeping memory request number unchanged. TSC divides tag into two parts to reduce storage overhead, and it supports multiple cache line replacement in one cycle. TSC can also automatically adjust cache storage granularity to avoid performance loss for applications with good spatial locality. Our evaluation shows that TSC improves the baseline cache performance by 17.2% on average across a wide range of applications. It also out-performs other previous techniques significantly.

References

[1]
AMD Graphics Cores Next (GCN) Architecture White paper, 2012.
[2]
NVIDIA Kepler GK110 white paper. 2012.
[3]
NVIDIA's next generation CUDA compute architecture: Fermi. 2009.
[4]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163--174, April 2009.
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, Oct 2009.
[6]
X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu. Adaptive cache management for energy-efficient gpu computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 343--355, Washington, DC, USA, 2014. IEEE Computer Society.
[7]
J. Gaur, R. Srinivasan, S. Subramoney, and M. Chaudhuri. Efficient management of last-level caches in graphics processors for 3d scene rendering workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 395--407, New York, NY, USA, 2013. ACM.
[8]
A. González, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In Proceedings of the 9th International Conference on Supercomputing, ICS '95, pages 338--347, New York, NY, USA, 1995. ACM.
[9]
K. Inoue, K. Kai, and K. Murakami. Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged dram/logic lsis. In High-Performance Computer Architecture, 1999. Proceedings. Fifth International Symposium On, pages 218--222, Jan 1999.
[10]
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 60--71, New York, NY, USA, 2010. ACM.
[11]
W. Jia, K. Shaw, and M. Martonosi. Mrpb: Memory request prioritization for massively parallel processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 272--283, Feb 2014.
[12]
W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 15--24, New York, NY, USA, 2012. ACM.
[13]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelism for gpgpus. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 157--166, Piscataway, NJ, USA, 2013. IEEE Press.
[14]
S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and L. Shannon. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 376--388, Washington, DC, USA, 2012. IEEE Computer Society.
[15]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 487--498, New York, NY, USA, 2013. ACM.
[16]
C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 67--77, New York, NY, USA, 2015. ACM.
[17]
L. Li, D. Tong, Z. Xie, J. Lu, and X. Cheng. Optimal bypass monitor for high performance last-level caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 315--324, New York, NY, USA, 2012. ACM.
[18]
S. Li, D. H. Yoon, K. Chen, J. Zhao, J. H. Ahn, J. Brockman, Y. Xie, and N. Jouppi. Mage: Adaptive granularity and ecc for resilient and power efficient memory systems. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--11, Nov 2012.
[19]
J. S. Liptay. Structural aspects of the system/360 model 85: II the cache. IBM Syst. J., 7(1):15--21, Mar. 1968.
[20]
M. M. K. Martin, M. D. Hill, and D. J. Sorin. Why on-chip cache coherence is here to stay. Commun. ACM, 55(7):78--89, July 2012.
[21]
X. Mei and X. Chu. Dissecting gpu memory hierarchy through microbenchmarking. arXiv preprint arXiv:1509.02308, 2015.
[22]
C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal. A detailed gpu cache model based on reuse distance theory. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 37--48, Feb 2014.
[23]
NVIDIA. Whitepaper - nvidia next generation cuda compute architecture: Fermi. NVIDIA, 2011.
[24]
H. Packard. Inside the intel® itanium® 2 processor. Technical White Paper, 2002.
[25]
M. Qureshi, M. Suleman, and Y. Patt. Line distillation: Increasing cache capacity by filtering unused words in cache lines. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 250--259, Feb 2007.
[26]
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 381--391, New York, NY, USA, 2007. ACM.
[27]
M. Rhu, M. Sullivan, J. Leng, and M. Erez. A locality-aware memory hierarchy for energy-efficient gpu architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 86--98, New York, NY, USA, 2013. ACM.
[28]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83, Washington, DC, USA, 2012. IEEE Computer Society.
[29]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 99--110, New York, NY, USA, 2013. ACM.
[30]
J. B. Rothman and A. J. Smith. The pool of subsectors cache design. In Proceedings of the 13th International Conference on Supercomputing, ICS '99, pages 31--42, New York, NY, USA, 1999. ACM.
[31]
A. Sethia, D. Jamshidi, and S. Mahlke. Mascar: Speeding up gpu warps by reducing memory pitstops. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 174--185, Feb 2015.
[32]
A. Seznec. Decoupled sectored caches: Conciliating low tag implementation cost. In Proceedings of the 21st Annual International Symposium on Computer Architecture, ISCA '94, pages 384--393, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.
[33]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.
[34]
S. Thoziyoor, N. Muralimanohar, J. Ahn, and N. Jouppi. Cacti 5.1. HP Laboratories, April, 2, 2008.
[35]
A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache line size to application behavior. In Proceedings of the 13th International Conference on Supercomputing, ICS '99, pages 145--154, New York, NY, USA, 1999. ACM.
[36]
X. Xie, Y. Liang, G. Sun, and D. Chen. An efficient compiler framework for cache bypassing on gpus. In Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on, pages 516--523, Nov 2013.
[37]
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static and dynamic cache bypassing for gpus. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 76--88, Feb 2015.
[38]
D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 295--306, New York, NY, USA, 2011. ACM.
[39]
D. H. Yoon, M. K. Jeong, M. Sullivan, and M. Erez. The dynamic granularity memory system. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 548--559, Washington, DC, USA, 2012. IEEE Computer Society.

Cited By

View all
  • (2023)Graphfire: Synergizing Fetch, Insertion, and Replacement Policies for Graph AnalyticsIEEE Transactions on Computers10.1109/TC.2022.315752572:1(291-304)Online publication date: 1-Jan-2023
  • (2022)The Implications of Page Size Management on Graph Analytics2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00026(199-214)Online publication date: Nov-2022
  • (2019)Dynamically linked MSHRs for adaptive miss handling in GPUsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330390(510-521)Online publication date: 26-Jun-2019
  • Show More Cited By
  1. Tag-Split Cache for Efficient GPGPU Cache Utilization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '16: Proceedings of the 2016 International Conference on Supercomputing
    June 2016
    547 pages
    ISBN:9781450343619
    DOI:10.1145/2925426
    © 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cache Organization
    2. GPGPU
    3. Spatial Locality

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICS '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)404
    • Downloads (Last 6 weeks)23
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Graphfire: Synergizing Fetch, Insertion, and Replacement Policies for Graph AnalyticsIEEE Transactions on Computers10.1109/TC.2022.315752572:1(291-304)Online publication date: 1-Jan-2023
    • (2022)The Implications of Page Size Management on Graph Analytics2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00026(199-214)Online publication date: Nov-2022
    • (2019)Dynamically linked MSHRs for adaptive miss handling in GPUsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330390(510-521)Online publication date: 26-Jun-2019
    • (2019)LinebackerProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322222(183-196)Online publication date: 22-Jun-2019
    • (2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168831(214-227)Online publication date: 2018
    • (2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168831(214-227)Online publication date: 24-Feb-2018
    • (2018)FineRegProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00037(364-376)Online publication date: 20-Oct-2018
    • (2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGARCH Computer Architecture News10.1145/3093337.303770945:1(297-311)Online publication date: 4-Apr-2017
    • (2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGPLAN Notices10.1145/3093336.303770952:4(297-311)Online publication date: 4-Apr-2017
    • (2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGOPS Operating Systems Review10.1145/3093315.303770951:2(297-311)Online publication date: 4-Apr-2017
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media