ABSTRACT
General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power.
In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline efficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline efficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14.7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9.3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler.
- A. Bakhoda, G. Yuan, W. Fung, et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174, 2009.Google ScholarCross Ref
- A. Jog, E. Bolotin, Z. Guz, et al. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In GPGPU, pages 1--8, 2014. Google ScholarCross Ref
- A. Jog, O. Kayiran, A. Mishra, et al. Orchestrated scheduling and prefetching for GPGPUs. In ISCA, pages 332--343, 2013. Google ScholarDigital Library
- A. Jog, O. Kayiran, N. Nachiappan, et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS, pages 395--406, 2013. Google ScholarDigital Library
- A. Lashgar, A. Baniasadi, and A. Khonsari. Warp size impact in GPUs: large or small? In GPGPU, pages 146--152, 2013. Google ScholarDigital Library
- A. Yilmazer, Z. Chen, and D. Kaeli. Scalar waving: improving the efficiency of SIMD execution on GPUs. In IPDPS, pages 103--112, 2014. Google ScholarDigital Library
- B. He, W. Fang, Q. Luo, et al. Mars: a MapReduce framework on graphics processors. In PACT, pages 260--269, 2008. Google ScholarDigital Library
- J. Adriaens, K. Compton, N. Kim, et al. The case for GPGPU spatial multitasking. In HPCA, pages 1--12, 2012. Google ScholarDigital Library
- J. Chen, X. Tao, Z. Yang, et al. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. In IPDPS, pages 441--451, 2013. Google ScholarDigital Library
- J. Jablin, T. Jablin, O. Mutlu, et al. Warp-aware trace scheduling for GPUs. In PACT, pages 163--174, 2014. Google ScholarDigital Library
- J. Lee and H. Kim. TAP: a TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In HPCA, pages 1--12, 2012. Google ScholarDigital Library
- J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, pages 235--246, 2010. Google ScholarDigital Library
- J. Stratton, C. Rodrigues, I. Sung, et al. Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.Google Scholar
- Khronos Group. The open standard for parallel programming of heterogeneous systems, 2013.Google Scholar
- L. Chen, O. Villa, S. Krishnamoorthy, et al. Dynamic load balancing on single- and multi-GPU systems. In IPDPS, pages 1--12, 2010.Google ScholarCross Ref
- M. Gebhart, D. Johnson, D. Tarjan, et al. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA, pages 235--246, 2011. Google ScholarDigital Library
- M. Gebhart, D. Johnson, D. Tarjan, et al. A hierarchical thread scheduler and register file for energy-efficient throughput processors. ACM Transactions on Computer Systems, 30(2):1--38, 2012. Google ScholarDigital Library
- M. Lee, S. Song, J. Moon, et al. Improving GPGPU resource utilization through alternative thread block scheduling. In HPCA, pages 263--273, 2014.Google ScholarCross Ref
- N. AlSaber and M. Kulkarni. SemCache: semantics-aware caching for efficient GPU offloading. In ICS, pages 421--432, 2013. Google ScholarDigital Library
- N. Brunie, S. Collange, and G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. In ISCA, pages 49--60, 2012. Google ScholarDigital Library
- NVIDIA. CUDA C/C++SDK code samples, 2011.Google Scholar
- NVIDIA. CUDA C Programming Guide, 2012.Google Scholar
- O. Kayiran, A. Jog, M. Kandemir, et al. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In PACT, pages 157--166, 2013. Google ScholarDigital Library
- O. Kayiran, N. Nachiappan, A. Jog, et al. Managing GPU concurrency in heterogeneous architectures. In MICRO, pages 1--13, 2014. Google ScholarDigital Library
- P. Xiang, Y. Yang, M. Mantor, et al. Exploiting uniform vector instructions for GPGPU performance, energy Efficiency, and opportunistic reliability enhancement. In ICS, pages 433--442, 2013. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, et al. Rodinia: a benchmark suite for heterogeneous computing. In IISWC, pages 44--54, 2009. Google ScholarDigital Library
- S. Lee and C. Wu. CAWS: criticality-aware warp scheduling for GPGPU workloads. In PACT, pages 175--186, 2014. Google ScholarDigital Library
- S. Pai, R. Govindarajan, and M. Thazhuthaveetil. Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels. In PACT, pages 483--484, 2014. Google ScholarDigital Library
- T. Rogers, M. O'Connor, and T. Aamodt. Cache-conscious wavefront scheduling. In MICRO, pages 72--83, 2012. Google ScholarDigital Library
- V. Narasiman, M. Shebanow, C. Lee, et al. Improving GPU performance via large warps and two-level warp scheduling. In MICRO, pages 308--317, 2011. Google ScholarDigital Library
- W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA, pages 25--36, 2011. Google ScholarDigital Library
- W. Fung, I. Sham, G. Yuan, et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, pages 407--420, 2007. Google ScholarDigital Library
- W. Jia, K. Shaw, and M. Martonosi. MRPB: memory request prioritization for massively parallel processors. In HPCA, pages 274--285, 2014.Google ScholarCross Ref
- X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient scheduling of recursive control flow on GPUs. In ICS, pages 409--420, 2013. Google ScholarDigital Library
- Y. Yu, X. He, H. Guo, et al. A credit-based load-balance-aware CTA scheduling optimization scheme in GPGPU. International Journal of Parallel Programming, 42(8):1--21, 2014.Google Scholar
- Y. Yu, X. He, H. Guo, et al. APR: a novel parallel repacking algorithm for efficient GPGPU parallel code transformation. In GPGPU, pages 81--89, 2014. Google ScholarCross Ref
Index Terms
- A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
Recommendations
Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on MicroarchitectureDue to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...
Evolution of thread-level parallelism in desktop applications
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureAs the effective limits of frequency and instruction level parallelism have been reached, the strategy of microprocessor vendors has changed to increase the number of processing cores on a single chip each generation. The implicit expectation is that ...
Evolution of thread-level parallelism in desktop applications
ISCA '10As the effective limits of frequency and instruction level parallelism have been reached, the strategy of microprocessor vendors has changed to increase the number of processing cores on a single chip each generation. The implicit expectation is that ...
Comments