skip to main content
10.1145/2925426.2926281acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Origami: Folding Warps for Energy Efficient GPUs

Published: 01 June 2016 Publication History

Abstract

Graphical processing units (GPUs) are increasingly used to run a wide range of general purpose applications. Due to wide variation in application parallelism and inherent application level inefficiencies, GPUs experience significant idle periods. In this work, we first show that significant fine-grain pipeline bubbles exist regardless of warp scheduling policies or workloads. We propose to convert these bubbles into energy saving opportunities using Origami. Origami consists of two components: Warp Folding and the Origami scheduler. With Warp Folding, warps are split into two half-warps which are issued in succession. Warp Folding leaves half of the execution lanes idle, which is then exploited to improve energy efficiency through power gating. Origami scheduler is a new warp scheduler that is cognizant of the Warp Folding process and tries to further extend the sleep times of idle execution lanes. By combining the two techniques Origami can save 49% and 46% of the leakage energy in the integer and floating point pipelines, respectively. These savings are better than or at least on-par with Warped-Gates, a prior power gating technique that power gates the entire cluster of execution lanes. But Origami achieves these energy savings without relying on forcing idleness on execution lanes, which leads to performance losses, as has been proposed in Warped-Gates. Hence, Origami is able to achieve these energy savings with virtually no performance overhead.

References

[1]
The freepdk process design kit. http://www.eda.ncsu.edu/wiki/FreePDK.
[2]
Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.
[3]
Nvidia's next generation cuda compute architecture: Fermi. Technical report, Nvidia, 2009.
[4]
Nvidia's next generation cuda compute architecture: Kepler tm gk110. Technical report, Nvidia, 2012.
[5]
M. Abdel-Majeed and M. Annavaram. Warped register file: A power efficient register file for gpgpus. In Proceedings of the International Symposium on High Performance Computer Architecture, 2013.
[6]
M. Abdel-Majeed, D. Wong, and M. Annavaram. Warped gates: Gating aware scheduling and power gating for gpgpus. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013.
[7]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009.
[8]
D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, and J. Duato. A simple power-aware scheduling for multicore systems when running real-time applications. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008., 2008.
[9]
N. Brunie, S. Collange, and G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. In Proceedings of the 39th Annual International Symposium on Computer Architecture, 2012.
[10]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC).
[11]
L. Chen and T. Pinkston. Nord: Node-router decoupling for effective power-gating of on-chip routers. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.
[12]
K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: simple techniques for reducing leakage power. In Proceedings of the 29th annual international symposium on Computer architecture, 2002.
[13]
W. Fung and T. Aamodt. Thread block compaction for efficient simt control flow. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture, 2011.
[14]
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007.
[15]
M. Gebhart, D. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, 2011.
[16]
S. Gilani, N. S. Kim, and M. Schulte. Power-efficient computing for compute-intensive gpgpu applications. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture, 2013.
[17]
Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose. Microarchitectural techniques for power gating of execution units. In Proceedings of the 2004 International Symposium on Low Power Electronics and Design, 2004.
[18]
H. Jeon. Resource underutilization exploitation for power efficient and reliable throughput processor. PhD thesis, University of Southern California, 2015.
[19]
H. Jeon and M. Annavaram. Warped-dmr: Light-weight error detection for gpgpu. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.
[20]
A. Jog, O. Kayiran, A. Mishra, M. Kandemir, O. Mutlu, R. Iyer, and C. Das. Orchestrated scheduling and prefetching for gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.
[21]
J. Kao and A. Chandrakasan. Dual-threshold voltage techniques for low-power digital circuits. IEEE Journal of Solid-State Circuits, 2000.
[22]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi1. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.
[23]
A. Lungu, P. Bose, A. Buyuktosunoglu, and D. J. Sorin. Dynamic power gating with quality guarantees. In Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design, 2009.
[24]
N. Madan, A. Buyuktosunoglu, P. Bose, and M. Annavaram. A case for guarded power gating for multi-core processors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011.
[25]
D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: Eliminating server idle power. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009.
[26]
D. Meisner and T. F. Wenisch. Dreamweaver: Architectural support for deep sleep. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.
[27]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011.
[28]
I. Paul, W. Huang, M. Arora, and S. Yalamanchili. Harmonia: Balancing compute and memory power in high-performance gpus. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015.
[29]
M. Rhu and M. Erez. The dual-path execution model for efficient gpu control flow. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture, 2013.
[30]
T. G. Rogers, D. R. Johnson, M. O'Connor, and S. W. Keckler. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, 2015.
[31]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.
[32]
C. Scordino and G. Lipari. Using resource reservation techniques for power-aware scheduling. In Proceedings of the 4th ACM international conference on Embedded software, 2004.
[33]
Y. Wang, S. Roy, and N. Ranganathan. Run-time power-gating in caches of gpus for leakage energy savings. In Proceedings of the Design, Automation Test in Europe Conference Exhibition, 2012.
[34]
Q. Xu and M. Annavaram. Pats: Pattern aware scheduling and power gating for gpgpus. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014.
[35]
S. Yue, L. Chen, D. Zhu, T. M. Pinkston, and M. Pedram. Smart butterfly: Reducing static power dissipation of network-on-chip with core-state-awareness. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design, 2014.

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2021)MAPAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480853(1-14)Online publication date: 14-Nov-2021
  • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
  • Show More Cited By
  1. Origami: Folding Warps for Energy Efficient GPUs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '16: Proceedings of the 2016 International Conference on Supercomputing
    June 2016
    547 pages
    ISBN:9781450343619
    DOI:10.1145/2925426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPUs
    2. Leakage power
    3. Power gating
    4. SIMT lanes

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICS '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)183
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 09 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
    • (2021)MAPAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480853(1-14)Online publication date: 14-Nov-2021
    • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
    • (2020)BOW: Breathing Operand Windows to Exploit Bypassing in GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00084(996-1008)Online publication date: Oct-2020
    • (2020)Compiler-Directed Parallelism Scaling Framework for Performance Constrained Energy OptimizationIEEE Access10.1109/ACCESS.2019.29615688(1733-1754)Online publication date: 2020
    • (2019)CORFProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304026(701-714)Online publication date: 4-Apr-2019
    • (2018)Efficiently Managing the Impact of Hardware Variability on GPUs’ Streaming ProcessorsACM Transactions on Design Automation of Electronic Systems10.1145/328730824:1(1-15)Online publication date: 21-Dec-2018
    • (2017)Pilot Register File: Energy Efficient Partitioned Register File for GPUs2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2017.47(589-600)Online publication date: Feb-2017

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media