skip to main content
10.1145/2751205.2751234acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Authors Info & Claims
Published:08 June 2015Publication History

ABSTRACT

General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power.

In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline efficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline efficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14.7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9.3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler.

References

  1. A. Bakhoda, G. Yuan, W. Fung, et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Jog, E. Bolotin, Z. Guz, et al. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In GPGPU, pages 1--8, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  3. A. Jog, O. Kayiran, A. Mishra, et al. Orchestrated scheduling and prefetching for GPGPUs. In ISCA, pages 332--343, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Jog, O. Kayiran, N. Nachiappan, et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS, pages 395--406, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Lashgar, A. Baniasadi, and A. Khonsari. Warp size impact in GPUs: large or small? In GPGPU, pages 146--152, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Yilmazer, Z. Chen, and D. Kaeli. Scalar waving: improving the efficiency of SIMD execution on GPUs. In IPDPS, pages 103--112, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. He, W. Fang, Q. Luo, et al. Mars: a MapReduce framework on graphics processors. In PACT, pages 260--269, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Adriaens, K. Compton, N. Kim, et al. The case for GPGPU spatial multitasking. In HPCA, pages 1--12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Chen, X. Tao, Z. Yang, et al. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. In IPDPS, pages 441--451, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Jablin, T. Jablin, O. Mutlu, et al. Warp-aware trace scheduling for GPUs. In PACT, pages 163--174, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Lee and H. Kim. TAP: a TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In HPCA, pages 1--12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, pages 235--246, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Stratton, C. Rodrigues, I. Sung, et al. Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.Google ScholarGoogle Scholar
  14. Khronos Group. The open standard for parallel programming of heterogeneous systems, 2013.Google ScholarGoogle Scholar
  15. L. Chen, O. Villa, S. Krishnamoorthy, et al. Dynamic load balancing on single- and multi-GPU systems. In IPDPS, pages 1--12, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  16. M. Gebhart, D. Johnson, D. Tarjan, et al. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA, pages 235--246, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Gebhart, D. Johnson, D. Tarjan, et al. A hierarchical thread scheduler and register file for energy-efficient throughput processors. ACM Transactions on Computer Systems, 30(2):1--38, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Lee, S. Song, J. Moon, et al. Improving GPGPU resource utilization through alternative thread block scheduling. In HPCA, pages 263--273, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  19. N. AlSaber and M. Kulkarni. SemCache: semantics-aware caching for efficient GPU offloading. In ICS, pages 421--432, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Brunie, S. Collange, and G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. In ISCA, pages 49--60, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. NVIDIA. CUDA C/C++SDK code samples, 2011.Google ScholarGoogle Scholar
  22. NVIDIA. CUDA C Programming Guide, 2012.Google ScholarGoogle Scholar
  23. O. Kayiran, A. Jog, M. Kandemir, et al. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In PACT, pages 157--166, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Kayiran, N. Nachiappan, A. Jog, et al. Managing GPU concurrency in heterogeneous architectures. In MICRO, pages 1--13, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Xiang, Y. Yang, M. Mantor, et al. Exploiting uniform vector instructions for GPGPU performance, energy Efficiency, and opportunistic reliability enhancement. In ICS, pages 433--442, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Che, M. Boyer, J. Meng, et al. Rodinia: a benchmark suite for heterogeneous computing. In IISWC, pages 44--54, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Lee and C. Wu. CAWS: criticality-aware warp scheduling for GPGPU workloads. In PACT, pages 175--186, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Pai, R. Govindarajan, and M. Thazhuthaveetil. Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels. In PACT, pages 483--484, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Rogers, M. O'Connor, and T. Aamodt. Cache-conscious wavefront scheduling. In MICRO, pages 72--83, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Narasiman, M. Shebanow, C. Lee, et al. Improving GPU performance via large warps and two-level warp scheduling. In MICRO, pages 308--317, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA, pages 25--36, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Fung, I. Sham, G. Yuan, et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, pages 407--420, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Jia, K. Shaw, and M. Martonosi. MRPB: memory request prioritization for massively parallel processors. In HPCA, pages 274--285, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  34. X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient scheduling of recursive control flow on GPUs. In ICS, pages 409--420, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Yu, X. He, H. Guo, et al. A credit-based load-balance-aware CTA scheduling optimization scheme in GPGPU. International Journal of Parallel Programming, 42(8):1--21, 2014.Google ScholarGoogle Scholar
  36. Y. Yu, X. He, H. Guo, et al. APR: a novel parallel repacking algorithm for efficient GPGPU parallel code transformation. In GPGPU, pages 81--89, 2014. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
      June 2015
      446 pages
      ISBN:9781450335591
      DOI:10.1145/2751205

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 June 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICS '15 Paper Acceptance Rate40of160submissions,25%Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader