skip to main content
10.1145/2716282.2716291acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Efficient utilization of GPGPU cache hierarchy

Published:07 February 2015Publication History

ABSTRACT

Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our proposed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wavefront Scheduler and Memory Request Prioritization Buffer respectively.

References

  1. AMD. AMD’s Graphics Core Next Arhcitecure whitepaper.Google ScholarGoogle Scholar
  2. A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  3. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general-purpose applications on graphics processors using CUDA. Journal of parallel and distributed computing, 68(10):1370–1380, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach. Accelerating compute-intensive applications with GPUs and FPGAs. In Application Specific Processors, 2008. SASP 2008. Symposium on, pages 101–107. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. Chen, L.-W. Chang, C. I. Rodrigues, L. Ji, Z. Wang, and W. mei Hwu. Adaptive Cache Management for Energy-efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014.Google ScholarGoogle Scholar
  7. X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, and W.-M. W. Hwu. Adaptive cache bypass and insertion for many-core accelerators. In Proceedings of International Workshop on Manycore Embedded Systems, MES ’14, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 225–236. IEEE Computer Society, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum. Improving cache management policies using dynamic reuse distances. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 389–400. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 407–420. IEEE Computer Society, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. González, M. Valero, N. Topham, and J. M. Parcerisa. Eliminating cache conflict misses through XOR-based placement functions. In Proceedings of the 11th international conference on Supercomputing, pages 76–83. ACM, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. L. Gwennap. Sandy Bridge spans generations. Microprocessor Report, 9(27):10–01, 2010.Google ScholarGoogle Scholar
  15. D. T. Harper and J. R. Jump. Vector access performance in parallel memories using a skewed storage scheme. Computers, IEEE Transactions on, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. Computers, IEEE Transactions on, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Jia, K. A. Shaw, and M. a. Martonosi. MRPB: Memory Request Prioritization for Massively Parallel Processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS), pages 395–406. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelism for gpgpus. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 157–166. IEEE Press, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7–17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. Using prime numbers for cache indexing to eliminate conflict misses. In Software, IEEE Proceedings-, 2004.Google ScholarGoogle Scholar
  22. D. Kirk and W. Wen-mei. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. H. Lawrie and C. R. Vora. The prime memory system for array access. IEEE transactions on Computers, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Lee and H. Kim. Tap: A tlp-aware cache management policy for a cpu-gpu heterogeneous architecture. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pages 1–12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 260–271, Feb 2014.Google ScholarGoogle ScholarCross RefCross Ref
  26. D. Li. Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors. PhD thesis, The University Of Texas At Austin, May 2014.Google ScholarGoogle Scholar
  27. Mathworld. mathworld.wolfram.com/IrreduciblePolynomial.html.Google ScholarGoogle Scholar
  28. R. Meltzer, C. Zeng, and C. Cecka. Micro-benchmarking the C2070. In GPU Technology Conference. Citeseer, 2013.Google ScholarGoogle Scholar
  29. J. Nickolls and W. J. Dally. The GPU computing era. Micro, IEEE, 30(2):56–69, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. NVIDIA. CUDA C Programming Guide v5.5.Google ScholarGoogle Scholar
  31. NVIDIA. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-codesamples.Google ScholarGoogle Scholar
  32. NVIDIA. NVIDIA Next Generation CUDA Compute Architecture: Kepler GK110.Google ScholarGoogle Scholar
  33. OpenCL. The OpenCL Specification version 2.0. http://www.khronos.org.Google ScholarGoogle Scholar
  34. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU computing. Proceedings of the IEEE, 96(5):879–899, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  35. Paulius Micikevicius. GPU Performance Analysis and Optimization, 2012.Google ScholarGoogle Scholar
  36. M. K. Qureshi, D. Thompson, and Y. N. Patt. The v-way cache: demand-based associativity via global replacement. In Computer Architecture, 2005. ISCA’05. Proceedings. 32nd International Symposium on, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. B. R. Rau. Pseudo-randomly interleaved memory. In Proceedings of the 18th Annual International Symposium on Computer Architecture, ISCA ’91, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 72–83. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Seznec. A case for two-way skewed-associative caches. In ACM SIGARCH Computer Architecture News, pages 169–178. ACM, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. G. S. Sohi. Logical data skewing schemes for interleaved memories in vector processors. 1988.Google ScholarGoogle Scholar
  42. N. Topham, A. González, and J. González. The design and performance of a conflict-avoiding cache. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi GF100 GPU architecture. Micro, IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Z. Zheng, Z. Wang, and M. Lipasti. Adaptive Cache and Concurrency Allocation on GPGPUs. Computer Architecture Letters, 2014.Google ScholarGoogle Scholar

Index Terms

  1. Efficient utilization of GPGPU cache hierarchy

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs
        February 2015
        120 pages
        ISBN:9781450334075
        DOI:10.1145/2716282

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 February 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate57of129submissions,44%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader