research-article

Efficient utilization of GPGPU cache hierarchy

Authors:
Mahmoud Khairy

Cairo University, Egypt

Cairo University, Egypt
View Profile

,
Mohamed Zahran

New York University, USA

New York University, USA
View Profile

,
Amr G. Wassal

Cairo University, Egypt

Cairo University, Egypt
View Profile

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUsFebruary 2015Pages 36–47https://doi.org/10.1145/2716282.2716291

Published:07 February 2015Publication History

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Pages 36–47

ABSTRACT

Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our proposed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wavefront Scheduler and Memory Request Prioritization Buffer respectively.

References

AMD. AMD’s Graphics Core Next Arhcitecure whitepaper.Google Scholar
A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, 2009.Google ScholarCross Ref
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Google ScholarDigital Library
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general-purpose applications on graphics processors using CUDA. Journal of parallel and distributed computing, 68(10):1370–1380, 2008. Google ScholarDigital Library
S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach. Accelerating compute-intensive applications with GPUs and FPGAs. In Application Specific Processors, 2008. SASP 2008. Symposium on, pages 101–107. IEEE, 2008. Google ScholarDigital Library
X. Chen, L.-W. Chang, C. I. Rodrigues, L. Ji, Z. Wang, and W. mei Hwu. Adaptive Cache Management for Energy-efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014.Google Scholar
X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, and W.-M. W. Hwu. Adaptive cache bypass and insertion for many-core accelerators. In Proceedings of International Workshop on Manycore Embedded Systems, MES ’14, 2014. Google ScholarDigital Library
E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 225–236. IEEE Computer Society, 2010. Google ScholarDigital Library
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010. Google ScholarDigital Library
N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum. Improving cache management policies using dynamic reuse distances. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 389–400. IEEE Computer Society, 2012. Google ScholarDigital Library
W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 407–420. IEEE Computer Society, 2007. Google ScholarDigital Library
A. González, M. Valero, N. Topham, and J. M. Parcerisa. Eliminating cache conflict misses through XOR-based placement functions. In Proceedings of the 11th international conference on Supercomputing, pages 76–83. ACM, 1997. Google ScholarDigital Library
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012, 2012.Google ScholarCross Ref
L. Gwennap. Sandy Bridge spans generations. Microprocessor Report, 9(27):10–01, 2010.Google Scholar
D. T. Harper and J. R. Jump. Vector access performance in parallel memories using a skewed storage scheme. Computers, IEEE Transactions on, 1987. Google ScholarDigital Library
M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. Computers, IEEE Transactions on, 1989. Google ScholarDigital Library
W. Jia, K. A. Shaw, and M. a. Martonosi. MRPB: Memory Request Prioritization for Massively Parallel Processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, 2014.Google ScholarCross Ref
A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS), pages 395–406. ACM, 2013. Google ScholarDigital Library
O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelism for gpgpus. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 157–166. IEEE Press, 2013. Google ScholarDigital Library
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7–17, 2011. Google ScholarDigital Library
M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. Using prime numbers for cache indexing to eliminate conflict misses. In Software, IEEE Proceedings-, 2004.Google Scholar
D. Kirk and W. Wen-mei. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann, 2010. Google ScholarDigital Library
D. H. Lawrie and C. R. Vora. The prime memory system for array access. IEEE transactions on Computers, 1982. Google ScholarDigital Library
J. Lee and H. Kim. Tap: A tlp-aware cache management policy for a cpu-gpu heterogeneous architecture. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pages 1–12, 2012. Google ScholarDigital Library
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 260–271, Feb 2014.Google ScholarCross Ref
D. Li. Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors. PhD thesis, The University Of Texas At Austin, May 2014.Google Scholar
Mathworld. mathworld.wolfram.com/IrreduciblePolynomial.html.Google Scholar
R. Meltzer, C. Zeng, and C. Cecka. Micro-benchmarking the C2070. In GPU Technology Conference. Citeseer, 2013.Google Scholar
J. Nickolls and W. J. Dally. The GPU computing era. Micro, IEEE, 30(2):56–69, 2010. Google ScholarDigital Library
NVIDIA. CUDA C Programming Guide v5.5.Google Scholar
NVIDIA. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-codesamples.Google Scholar
NVIDIA. NVIDIA Next Generation CUDA Compute Architecture: Kepler GK110.Google Scholar
OpenCL. The OpenCL Specification version 2.0. http://www.khronos.org.Google Scholar
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU computing. Proceedings of the IEEE, 96(5):879–899, 2008.Google ScholarCross Ref
Paulius Micikevicius. GPU Performance Analysis and Optimization, 2012.Google Scholar
M. K. Qureshi, D. Thompson, and Y. N. Patt. The v-way cache: demand-based associativity via global replacement. In Computer Architecture, 2005. ISCA’05. Proceedings. 32nd International Symposium on, 2005. Google ScholarDigital Library
B. R. Rau. Pseudo-randomly interleaved memory. In Proceedings of the 18th Annual International Symposium on Computer Architecture, ISCA ’91, 1991. Google ScholarDigital Library
T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 72–83. IEEE Computer Society, 2012. Google ScholarDigital Library
T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013. Google ScholarDigital Library
A. Seznec. A case for two-way skewed-associative caches. In ACM SIGARCH Computer Architecture News, pages 169–178. ACM, 1993. Google ScholarDigital Library
G. S. Sohi. Logical data skewing schemes for interleaved memories in vector processors. 1988.Google Scholar
N. Topham, A. González, and J. González. The design and performance of a conflict-avoiding cache. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, 1997. Google ScholarDigital Library
C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi GF100 GPU architecture. Micro, IEEE, 2011. Google ScholarDigital Library
Z. Zheng, Z. Wang, and M. Lipasti. Adaptive Cache and Concurrency Allocation on GPGPUs. Computer Architecture Letters, 2014.Google Scholar

Index Terms

Efficient utilization of GPGPU cache hierarchy
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data ...
Read More
Adaptive Cache Management for Energy-Efficient GPU Computing
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been ...
Read More
Tag-Split Cache for Efficient GPGPU Cache Utilization
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs
February 2015
120 pages
ISBN:9781450334075
DOI:10.1145/2716282
Program Chairs:
David Kaeli
Northeastern University, USA
,
John Cavazos
University of Delaware, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 February 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cache Bypassing
Cache Management
Conflict-avoiding
GPGPU
Warp Throttling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate57of129submissions,44%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 707
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient utilization of GPGPU cache hierarchy

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

ABSTRACT

References

Cited By

Index Terms

Recommendations

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

Adaptive Cache Management for Energy-Efficient GPU Computing

Tag-Split Cache for Efficient GPGPU Cache Utilization