research-article

Public Access

Origami: Folding Warps for Energy Efficient GPUs

Authors:

Mohammad Abdel-Majeed,

Murali AnnavaramAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 41, Pages 1 - 12

https://doi.org/10.1145/2925426.2926281

Published: 01 June 2016 Publication History

Abstract

Graphical processing units (GPUs) are increasingly used to run a wide range of general purpose applications. Due to wide variation in application parallelism and inherent application level inefficiencies, GPUs experience significant idle periods. In this work, we first show that significant fine-grain pipeline bubbles exist regardless of warp scheduling policies or workloads. We propose to convert these bubbles into energy saving opportunities using Origami. Origami consists of two components: Warp Folding and the Origami scheduler. With Warp Folding, warps are split into two half-warps which are issued in succession. Warp Folding leaves half of the execution lanes idle, which is then exploited to improve energy efficiency through power gating. Origami scheduler is a new warp scheduler that is cognizant of the Warp Folding process and tries to further extend the sleep times of idle execution lanes. By combining the two techniques Origami can save 49% and 46% of the leakage energy in the integer and floating point pipelines, respectively. These savings are better than or at least on-par with Warped-Gates, a prior power gating technique that power gates the entire cluster of execution lanes. But Origami achieves these energy savings without relying on forcing idleness on execution lanes, which leads to performance losses, as has been proposed in Warped-Gates. Hence, Origami is able to achieve these energy savings with virtually no performance overhead.

References

[1]

The freepdk process design kit. http://www.eda.ncsu.edu/wiki/FreePDK.

[2]

Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.

[3]

Nvidia's next generation cuda compute architecture: Fermi. Technical report, Nvidia, 2009.

[4]

Nvidia's next generation cuda compute architecture: Kepler tm gk110. Technical report, Nvidia, 2012.

[5]

M. Abdel-Majeed and M. Annavaram. Warped register file: A power efficient register file for gpgpus. In Proceedings of the International Symposium on High Performance Computer Architecture, 2013.

Digital Library

[6]

M. Abdel-Majeed, D. Wong, and M. Annavaram. Warped gates: Gating aware scheduling and power gating for gpgpus. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013.

Digital Library

[7]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009.

[8]

D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, and J. Duato. A simple power-aware scheduling for multicore systems when running real-time applications. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008., 2008.

[9]

N. Brunie, S. Collange, and G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. In Proceedings of the 39th Annual International Symposium on Computer Architecture, 2012.

Digital Library

[10]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC).

Digital Library

[11]

L. Chen and T. Pinkston. Nord: Node-router decoupling for effective power-gating of on-chip routers. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.

Digital Library

[12]

K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: simple techniques for reducing leakage power. In Proceedings of the 29th annual international symposium on Computer architecture, 2002.

Digital Library

[13]

W. Fung and T. Aamodt. Thread block compaction for efficient simt control flow. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture, 2011.

Digital Library

[14]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007.

Digital Library

[15]

M. Gebhart, D. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, 2011.

Digital Library

[16]

S. Gilani, N. S. Kim, and M. Schulte. Power-efficient computing for compute-intensive gpgpu applications. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture, 2013.

Digital Library

[17]

Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose. Microarchitectural techniques for power gating of execution units. In Proceedings of the 2004 International Symposium on Low Power Electronics and Design, 2004.

Digital Library

[18]

H. Jeon. Resource underutilization exploitation for power efficient and reliable throughput processor. PhD thesis, University of Southern California, 2015.

[19]

H. Jeon and M. Annavaram. Warped-dmr: Light-weight error detection for gpgpu. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.

Digital Library

[20]

A. Jog, O. Kayiran, A. Mishra, M. Kandemir, O. Mutlu, R. Iyer, and C. Das. Orchestrated scheduling and prefetching for gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.

Digital Library

[21]

J. Kao and A. Chandrakasan. Dual-threshold voltage techniques for low-power digital circuits. IEEE Journal of Solid-State Circuits, 2000.

[22]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi1. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.

Digital Library

[23]

A. Lungu, P. Bose, A. Buyuktosunoglu, and D. J. Sorin. Dynamic power gating with quality guarantees. In Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design, 2009.

Digital Library

[24]

N. Madan, A. Buyuktosunoglu, P. Bose, and M. Annavaram. A case for guarded power gating for multi-core processors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011.

Digital Library

[25]

D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: Eliminating server idle power. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009.

Digital Library

[26]

D. Meisner and T. F. Wenisch. Dreamweaver: Architectural support for deep sleep. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.

Digital Library

[27]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011.

Digital Library

[28]

I. Paul, W. Huang, M. Arora, and S. Yalamanchili. Harmonia: Balancing compute and memory power in high-performance gpus. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015.

Digital Library

[29]

M. Rhu and M. Erez. The dual-path execution model for efficient gpu control flow. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture, 2013.

Digital Library

[30]

T. G. Rogers, D. R. Johnson, M. O'Connor, and S. W. Keckler. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, 2015.

Digital Library

[31]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.

Digital Library

[32]

C. Scordino and G. Lipari. Using resource reservation techniques for power-aware scheduling. In Proceedings of the 4th ACM international conference on Embedded software, 2004.

Digital Library

[33]

Y. Wang, S. Roy, and N. Ranganathan. Run-time power-gating in caches of gpus for leakage energy savings. In Proceedings of the Design, Automation Test in Europe Conference Exhibition, 2012.

Digital Library

[34]

Q. Xu and M. Annavaram. Pats: Pattern aware scheduling and power gating for gpgpus. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014.

Digital Library

[35]

S. Yue, L. Chen, D. Zhu, T. M. Pinkston, and M. Pedram. Smart butterfly: Reducing static power dissipation of network-on-chip with core-state-awareness. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design, 2014.

Digital Library

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Ranganath KSuetterlein JManzano JSong SWong Dde Supinski BHall MGamblin T(2021)MAPAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480853(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3480853
Abdolrashidi AEsfeden HJahanshahi ASingh KAbu-Ghazaleh NWong DMartínez JDuato JJohn L(2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00034
Show More Cited By

Origami: Folding Warps for Energy Efficient GPUs
1. Computer systems organization
  1. Architectures

Recommendations

Parallelism via Multithreaded and Multicore CPUs

Multicore and multithreaded CPUs have become the new approach to obtaining increases in CPU performance. Numeric applications mostly benefit from a large number of computationally powerful cores. Servers typically benefit more if chip circuitry is used ...
Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the ...
NUPAR: A Benchmark Suite for Modern GPU Architectures
ICPE '15: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
631
Total Downloads

Downloads (Last 12 months)183
Downloads (Last 6 weeks)26

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Ranganath KSuetterlein JManzano JSong SWong Dde Supinski BHall MGamblin T(2021)MAPAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480853(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3480853
Abdolrashidi AEsfeden HJahanshahi ASingh KAbu-Ghazaleh NWong DMartínez JDuato JJohn L(2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00034
Esfeden HAbdolrashidi ARahman SWong DAbu-Ghazaleh N(2020)BOW: Breathing Operand Windows to Exploit Bypassing in GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00084(996-1008)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00084
Ma Y(2020)Compiler-Directed Parallelism Scaling Framework for Performance Constrained Energy OptimizationIEEE Access10.1109/ACCESS.2019.29615688(1733-1754)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2019.2961568
Asghari Esfeden HKhorasani FJeon HWong DAbu-Ghazaleh NBahar IHerlihy MWitchel ELebeck A(2019)CORFProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304026(701-714)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304026
Tan JYan K(2018)Efficiently Managing the Impact of Hardware Variability on GPUs’ Streaming ProcessorsACM Transactions on Design Automation of Electronic Systems10.1145/328730824:1(1-15)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287308
Abdel-Majeed MShafaei AJeon HPedram MAnnavaram M(2017)Pilot Register File: Energy Efficient Partitioned Register File for GPUs2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2017.47(589-600)Online publication date: Feb-2017
https://doi.org/10.1109/HPCA.2017.47

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten