research-article

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Authors:
Yulong Yu

Dalian University of Technology, Dalian, China

Dalian University of Technology, Dalian, China
View Profile

,
Weijun Xiao

Virginia Commonwealth University, Richmond, VA, USA

Virginia Commonwealth University, Richmond, VA, USA
View Profile

,
Xubin He

Virginia Commonwealth University, Richmond, VA, USA

Virginia Commonwealth University, Richmond, VA, USA
View Profile

,
He Guo

Dalian University of Technology, Dalian, China

Dalian University of Technology, Dalian, China
View Profile

,
Yuxin Wang

Dalian University of Technology, Dalian, China

Dalian University of Technology, Dalian, China
View Profile

,
Xin Chen

Dalian University of Technology, Dalian, China

Dalian University of Technology, Dalian, China
View Profile

ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingJune 2015Pages 15–24https://doi.org/10.1145/2751205.2751234

Published:08 June 2015Publication History

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Pages 15–24

ABSTRACT

General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power.

In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline efficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline efficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14.7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9.3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler.

References

A. Bakhoda, G. Yuan, W. Fung, et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174, 2009.Google ScholarCross Ref
A. Jog, E. Bolotin, Z. Guz, et al. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In GPGPU, pages 1--8, 2014. Google ScholarCross Ref
A. Jog, O. Kayiran, A. Mishra, et al. Orchestrated scheduling and prefetching for GPGPUs. In ISCA, pages 332--343, 2013. Google ScholarDigital Library
A. Jog, O. Kayiran, N. Nachiappan, et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS, pages 395--406, 2013. Google ScholarDigital Library
A. Lashgar, A. Baniasadi, and A. Khonsari. Warp size impact in GPUs: large or small? In GPGPU, pages 146--152, 2013. Google ScholarDigital Library
A. Yilmazer, Z. Chen, and D. Kaeli. Scalar waving: improving the efficiency of SIMD execution on GPUs. In IPDPS, pages 103--112, 2014. Google ScholarDigital Library
B. He, W. Fang, Q. Luo, et al. Mars: a MapReduce framework on graphics processors. In PACT, pages 260--269, 2008. Google ScholarDigital Library
J. Adriaens, K. Compton, N. Kim, et al. The case for GPGPU spatial multitasking. In HPCA, pages 1--12, 2012. Google ScholarDigital Library
J. Chen, X. Tao, Z. Yang, et al. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. In IPDPS, pages 441--451, 2013. Google ScholarDigital Library
J. Jablin, T. Jablin, O. Mutlu, et al. Warp-aware trace scheduling for GPUs. In PACT, pages 163--174, 2014. Google ScholarDigital Library
J. Lee and H. Kim. TAP: a TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In HPCA, pages 1--12, 2012. Google ScholarDigital Library
J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, pages 235--246, 2010. Google ScholarDigital Library
J. Stratton, C. Rodrigues, I. Sung, et al. Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.Google Scholar
Khronos Group. The open standard for parallel programming of heterogeneous systems, 2013.Google Scholar
L. Chen, O. Villa, S. Krishnamoorthy, et al. Dynamic load balancing on single- and multi-GPU systems. In IPDPS, pages 1--12, 2010.Google ScholarCross Ref
M. Gebhart, D. Johnson, D. Tarjan, et al. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA, pages 235--246, 2011. Google ScholarDigital Library
M. Gebhart, D. Johnson, D. Tarjan, et al. A hierarchical thread scheduler and register file for energy-efficient throughput processors. ACM Transactions on Computer Systems, 30(2):1--38, 2012. Google ScholarDigital Library
M. Lee, S. Song, J. Moon, et al. Improving GPGPU resource utilization through alternative thread block scheduling. In HPCA, pages 263--273, 2014.Google ScholarCross Ref
N. AlSaber and M. Kulkarni. SemCache: semantics-aware caching for efficient GPU offloading. In ICS, pages 421--432, 2013. Google ScholarDigital Library
N. Brunie, S. Collange, and G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. In ISCA, pages 49--60, 2012. Google ScholarDigital Library
NVIDIA. CUDA C/C++SDK code samples, 2011.Google Scholar
NVIDIA. CUDA C Programming Guide, 2012.Google Scholar
O. Kayiran, A. Jog, M. Kandemir, et al. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In PACT, pages 157--166, 2013. Google ScholarDigital Library
O. Kayiran, N. Nachiappan, A. Jog, et al. Managing GPU concurrency in heterogeneous architectures. In MICRO, pages 1--13, 2014. Google ScholarDigital Library
P. Xiang, Y. Yang, M. Mantor, et al. Exploiting uniform vector instructions for GPGPU performance, energy Efficiency, and opportunistic reliability enhancement. In ICS, pages 433--442, 2013. Google ScholarDigital Library
S. Che, M. Boyer, J. Meng, et al. Rodinia: a benchmark suite for heterogeneous computing. In IISWC, pages 44--54, 2009. Google ScholarDigital Library
S. Lee and C. Wu. CAWS: criticality-aware warp scheduling for GPGPU workloads. In PACT, pages 175--186, 2014. Google ScholarDigital Library
S. Pai, R. Govindarajan, and M. Thazhuthaveetil. Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels. In PACT, pages 483--484, 2014. Google ScholarDigital Library
T. Rogers, M. O'Connor, and T. Aamodt. Cache-conscious wavefront scheduling. In MICRO, pages 72--83, 2012. Google ScholarDigital Library
V. Narasiman, M. Shebanow, C. Lee, et al. Improving GPU performance via large warps and two-level warp scheduling. In MICRO, pages 308--317, 2011. Google ScholarDigital Library
W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA, pages 25--36, 2011. Google ScholarDigital Library
W. Fung, I. Sham, G. Yuan, et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, pages 407--420, 2007. Google ScholarDigital Library
W. Jia, K. Shaw, and M. Martonosi. MRPB: memory request prioritization for massively parallel processors. In HPCA, pages 274--285, 2014.Google ScholarCross Ref
X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient scheduling of recursive control flow on GPUs. In ICS, pages 409--420, 2013. Google ScholarDigital Library
Y. Yu, X. He, H. Guo, et al. A credit-based load-balance-aware CTA scheduling optimization scheme in GPGPU. International Journal of Parallel Programming, 42(8):1--21, 2014.Google Scholar
Y. Yu, X. He, H. Guo, et al. APR: a novel parallel repacking algorithm for efficient GPGPU parallel code transformation. In GPGPU, pages 81--89, 2014. Google ScholarCross Ref

Index Terms

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...
Read More
Evolution of thread-level parallelism in desktop applications
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

As the effective limits of frequency and instruction level parallelism have been reached, the strategy of microprocessor vendors has changed to increase the number of processing cores on a single chip each generation. The implicit expectation is that ...
Read More
Evolution of thread-level parallelism in desktop applications
ISCA '10

As the effective limits of frequency and instruction level parallelism have been reached, the strategy of microprocessor vendors has changed to increase the number of processing cores on a single chip each generation. The implicit expectation is that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
June 2015
446 pages
ISBN:9781450335591
DOI:10.1145/2751205
General Chair:
Laxmi N. Bhuyan
University of California, Riverside
,
Program Chairs:
Fred Chong
University of California, Santa Barbara
,
Vivek Sarkar
Rice University
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
gpgpu
pipeline stall
thread level parallelism
two-level scheduling
warp scheduler
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '15 Paper Acceptance Rate40of160submissions,25%Overall Acceptance Rate584of2,055submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 446
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.