ABSTRACT
Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our proposed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wavefront Scheduler and Memory Request Prioritization Buffer respectively.
- AMD. AMD’s Graphics Core Next Arhcitecure whitepaper.Google Scholar
- A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, 2009.Google ScholarCross Ref
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general-purpose applications on graphics processors using CUDA. Journal of parallel and distributed computing, 68(10):1370–1380, 2008. Google ScholarDigital Library
- S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach. Accelerating compute-intensive applications with GPUs and FPGAs. In Application Specific Processors, 2008. SASP 2008. Symposium on, pages 101–107. IEEE, 2008. Google ScholarDigital Library
- X. Chen, L.-W. Chang, C. I. Rodrigues, L. Ji, Z. Wang, and W. mei Hwu. Adaptive Cache Management for Energy-efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014.Google Scholar
- X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, and W.-M. W. Hwu. Adaptive cache bypass and insertion for many-core accelerators. In Proceedings of International Workshop on Manycore Embedded Systems, MES ’14, 2014. Google ScholarDigital Library
- E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 225–236. IEEE Computer Society, 2010. Google ScholarDigital Library
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010. Google ScholarDigital Library
- N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum. Improving cache management policies using dynamic reuse distances. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 389–400. IEEE Computer Society, 2012. Google ScholarDigital Library
- W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 407–420. IEEE Computer Society, 2007. Google ScholarDigital Library
- A. González, M. Valero, N. Topham, and J. M. Parcerisa. Eliminating cache conflict misses through XOR-based placement functions. In Proceedings of the 11th international conference on Supercomputing, pages 76–83. ACM, 1997. Google ScholarDigital Library
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012, 2012.Google ScholarCross Ref
- L. Gwennap. Sandy Bridge spans generations. Microprocessor Report, 9(27):10–01, 2010.Google Scholar
- D. T. Harper and J. R. Jump. Vector access performance in parallel memories using a skewed storage scheme. Computers, IEEE Transactions on, 1987. Google ScholarDigital Library
- M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. Computers, IEEE Transactions on, 1989. Google ScholarDigital Library
- W. Jia, K. A. Shaw, and M. a. Martonosi. MRPB: Memory Request Prioritization for Massively Parallel Processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, 2014.Google ScholarCross Ref
- A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS), pages 395–406. ACM, 2013. Google ScholarDigital Library
- O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelism for gpgpus. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 157–166. IEEE Press, 2013. Google ScholarDigital Library
- S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7–17, 2011. Google ScholarDigital Library
- M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. Using prime numbers for cache indexing to eliminate conflict misses. In Software, IEEE Proceedings-, 2004.Google Scholar
- D. Kirk and W. Wen-mei. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann, 2010. Google ScholarDigital Library
- D. H. Lawrie and C. R. Vora. The prime memory system for array access. IEEE transactions on Computers, 1982. Google ScholarDigital Library
- J. Lee and H. Kim. Tap: A tlp-aware cache management policy for a cpu-gpu heterogeneous architecture. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pages 1–12, 2012. Google ScholarDigital Library
- M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 260–271, Feb 2014.Google ScholarCross Ref
- D. Li. Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors. PhD thesis, The University Of Texas At Austin, May 2014.Google Scholar
- Mathworld. mathworld.wolfram.com/IrreduciblePolynomial.html.Google Scholar
- R. Meltzer, C. Zeng, and C. Cecka. Micro-benchmarking the C2070. In GPU Technology Conference. Citeseer, 2013.Google Scholar
- J. Nickolls and W. J. Dally. The GPU computing era. Micro, IEEE, 30(2):56–69, 2010. Google ScholarDigital Library
- NVIDIA. CUDA C Programming Guide v5.5.Google Scholar
- NVIDIA. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-codesamples.Google Scholar
- NVIDIA. NVIDIA Next Generation CUDA Compute Architecture: Kepler GK110.Google Scholar
- OpenCL. The OpenCL Specification version 2.0. http://www.khronos.org.Google Scholar
- J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU computing. Proceedings of the IEEE, 96(5):879–899, 2008.Google ScholarCross Ref
- Paulius Micikevicius. GPU Performance Analysis and Optimization, 2012.Google Scholar
- M. K. Qureshi, D. Thompson, and Y. N. Patt. The v-way cache: demand-based associativity via global replacement. In Computer Architecture, 2005. ISCA’05. Proceedings. 32nd International Symposium on, 2005. Google ScholarDigital Library
- B. R. Rau. Pseudo-randomly interleaved memory. In Proceedings of the 18th Annual International Symposium on Computer Architecture, ISCA ’91, 1991. Google ScholarDigital Library
- T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 72–83. IEEE Computer Society, 2012. Google ScholarDigital Library
- T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013. Google ScholarDigital Library
- A. Seznec. A case for two-way skewed-associative caches. In ACM SIGARCH Computer Architecture News, pages 169–178. ACM, 1993. Google ScholarDigital Library
- G. S. Sohi. Logical data skewing schemes for interleaved memories in vector processors. 1988.Google Scholar
- N. Topham, A. González, and J. González. The design and performance of a conflict-avoiding cache. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, 1997. Google ScholarDigital Library
- C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi GF100 GPU architecture. Micro, IEEE, 2011. Google ScholarDigital Library
- Z. Zheng, Z. Wang, and M. Lipasti. Adaptive Cache and Concurrency Allocation on GPGPUs. Computer Architecture Letters, 2014.Google Scholar
Index Terms
- Efficient utilization of GPGPU cache hierarchy
Recommendations
Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureLong latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data ...
Adaptive Cache Management for Energy-Efficient GPU Computing
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on MicroarchitectureWith the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been ...
Tag-Split Cache for Efficient GPGPU Cache Utilization
ICS '16: Proceedings of the 2016 International Conference on SupercomputingModern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using ...
Comments