research-article

Harmonia: balancing compute and memory power in high-performance GPUs

Authors:

Sudhakar YalamanchiliAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 43, Issue 3S

Pages 54 - 65

https://doi.org/10.1145/2872887.2750404

Published: 13 June 2015 Publication History

Abstract

In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation.

Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables---compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED²) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.

References

[1]

AMD, "PowerTune Technology whitepaper, 2010."

[2]

M. Arora, S. Nath, S. Mazumdar, S. Baden, and D. Tullsen, "Redefining the Role of the CPU in the Era of CPU-GPU Integration," IEEE Micro, 2012.

Digital Library

[3]

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley," Technical Report UCB/EECS-183.2006, 2006.

[4]

W. L. Bircher, M. Valluri, J. Law, and L. John, "Runtime Identification of Microprocessor Energy Saving Opportunities," in International Symp. on Low Power Electronics and Design (ISLPED), 2005.

Digital Library

[5]

W. Brown, P. Wang, S. Plimpton, and A. Tharrington, "Implementing Molecular Dynamics on Hybrid High Performance Computers---Short Range Forces," Compute Physics Communications, 2011.

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IEEE Intl. Symp. on Workload Characterization, 2009.

Digital Library

[7]

S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in IEEE Intl. Symp. on Workload Characterization, 2011.

Digital Library

[8]

M. Chen, X. Wang, and X. Li, "Coordinating Processor and Main Memory for Efficient Server Power Control," in International Conference on Supercomputing (ICS), 2011.

Digital Library

[9]

J. Choi, D. Bedard, R. Fowler, and R. Vuduc, "A Roofline Model of Energy," in IEEE International Distributed Process Symposium, 2013.

Digital Library

[10]

CodeXL, "http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/."

[11]

M. Daga and M. Nutter, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on APUs," in Workshop on Irregular Applications, Architectures and Algorithms (IA3), 2012.

Digital Library

[12]

A. Danalis, G. Marin, C. McCurdy, J. Meredith, P. Roth, K. Spafford, V. Tipparaju, and J. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmarking Suite," in Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), 2010.

Digital Library

[13]

H. David, C. Fallin, E. Gorbatov, U. Hanebutte, and O. Mutlu, "Memory Power Management vis Dynamic Voltage/Frequency Scaling," in International Conference on Autonomous Computing (ICAC), 2011.

Digital Library

[14]

H. David, E. Gorbatov, U. Hanebutte, K. Khanna, and C. Le, "RAPL: Memory Power Estimation and Capping," in International Symposium on Low Power Electronics and Design (ISLPED), 2010.

Digital Library

[15]

Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, "CoScale: Coordinating CPU and Memory System DVFS in Server Systems," in International Symposium on Microarchitecture (MICRO), 2012.

Digital Library

[16]

Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, "MultiScale: Memory System DVFS with Multiple Memory Controllers," in International Symposium on Low Power Electronics and Design (ISLPED), 2012.

Digital Library

[17]

Q. Deng, D. Meisner, L. Ramos, T. Wenisch, and R. Bianchini, "Mem-Scale: Active Low-Power Modes for Main Memory," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.

Digital Library

[18]

B. Diniz, D. Guedez, W. Meira, and R. Bianchini, "Limiting the Power Consumption of Main Memory," in International Symposium on Computer Architecture (ISCA), 2007.

Digital Library

[19]

Elpida, "http://www.elpida.com/en/news/2011/06-27.html."

[20]

W. Felter, K. Rajamani, T. Keller, and C. Rusu, "A Performance-Conserving Approach for Reducing Peak Power Consumption in Server Systems," in International Conference on Supercomputing (ICS), 2005.

Digital Library

[21]

Green500 List, "http://www.green500.org."

[22]

M. Heroux, D. Doerfler, P. Crozier, J. Willenbring, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich, "Improving Performance via Mini-applications," Sandia Report, SAND2009-5574, 2009.

[23]

S. Hong and H. Kim, "An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness," in International Symposium on Computer Architecture (ISCA), 2009.

Digital Library

[24]

C. Hsu and W. Feng, "Effective Dynamic Voltage Scaling through CPU-Boundedness Detection," Lec. Notes in Computer Science, 2004.

Digital Library

[25]

W. Huang, M. Stan, K. Sankaranarayanan, R. Ribando, and K. Skadron, "Many-core Design from a Thermal Perspective," in Design Automation Conference (DAC), 2008.

Digital Library

[26]

JEDECWide I/O, "http://www.jedec.org/news/pressreleases/jedecpublishes-breakthrough-standard-wide-io-mobile-dram, jan 2012."

[27]

S. Kaxiras and M. Martonosi, "Computer Architecture Techniques for Power Efficiency," Synth. Lec. on Computer Architecture, 2008.

Digital Library

[28]

S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, 2011.

Digital Library

[29]

G. Kestor, R. Gioiosa, D. Kerbyson, and A. Hoisie, "Quantifying the Energy Cost of Data Movement in Scientific Applications," in International Symposium on Workload Characterization (IISWC), 2013.

[30]

J. Laros, K. Pedretti, S. Kelly, W. Shu, and C. Vaughan, "Energy Based Performance Tuning for Large Scale High Performance Computing Systems," in Symp. on High-Performance Computing, 2012.

Digital Library

[31]

J. Lee and H. Kim, "TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture," in International Conference on High-Performance Computer Architecture (HPCA), 2012.

Digital Library

[32]

J. Lee, V. Sathisha, M. Schulte, K. Compton, and N. S. Kim, "Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2011.

Digital Library

[33]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in International Symposium on Computer Architecture (ISCA), 2013.

Digital Library

[34]

C. Luk, S. Hong, and H. Kim, "Qilin: Exploiting Parallelism on Hetergeneous Multiprocessors with Adaptive Mapping," in International Symposium on Microarchitecture (MICRO), 2009.

Digital Library

[35]

M. Mantor and M. Houston, "AMD Graphics Core Next," in AMD Fusion Developer Summit, 2011.

[36]

A. McLaughlin, I. Paul, J. Greathouse, S. Manne, and S. Yalamanchili, "A Power Characterization and Management of GPU Graph Traversal," in Workshop on Architectures and Systems for Big Data, 2014.

[37]

R. Murphy, K. Wheeler, B. Barett, and J. Ang, "Introduing the Graph500," Cray User's Group (CUG), 2010.

[38]

Online, "http://www.anandtech.com/show/8217/intels-knights-landing-coprocessor-detailed."

[39]

Online, "http://www.techspot.com/news/52003-future-nvidia-volta-gpu-has-stacked-dram-offers-1tb-s-bandwidth.html, march 2013."

[40]

S. Pakin, C. Storlie, M. Lang, R. Fields, E. Romero, C. Idler, S. Michalak, H. Greeberg, J. Loncaric, R. Rheinheimer, G. Grider, and J. Wendelberger, "Power Usage of Production Supercomputers and Production Workloads," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2012.

[41]

I. Paul, S. Manne, M. Arora, W. L. Bircher, and S. Yalamanchili, "Cooperative Boost: Needy vs. Greedy Power Management," in International Symposium on Computer Architecture (ISCA), 2013.

Digital Library

[42]

I. Paul, V. Ravi, S. Manne, M. Arora, and S. Yalamanchili, "Coordinated Energy Management in Heterogeneous Processors," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2013.

Digital Library

[43]

J. Pawlowski, "Hybrid Memory Cube (HMC)," in HotChips, 2011.

[44]

E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weisman, "Power Management Architectures of the Intel Microarchitecture Code-Named Sandy Bridge," IEEE Micro, 2012.

Digital Library

[45]

B. Rountree, D. Lowenthal, B. de Supinski, M. Schulz, V. Freeh, and T. Bletsch, "Adagio: Making DVS Practical for Complex HPC Applications," in International Conference on Supercomputing (ICS), 2009.

Digital Library

[46]

B. Rountree, D. Lowenthal, S. Funk, V. Freeh, B. de Supinski, and M. Schulz, "Bounding Energy Consumption in Large-Scale MPI Programs," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2007.

Digital Library

[47]

J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in International Conference on High Performance Computing for Computational Science, 2010.

Digital Library

[48]

A. Sharifi, A. K. Mishra, S. Srikantaiah, M. Kandemir, and C. R. Das, "PEPON: performance-aware hierarchical power budgeting for NoC based multicores," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012.

Digital Library

[49]

A. Tiwari, M. Laurenzano, L. Carrington, and A. Snavely, "Autotuning for Energy Usage in Scientific Applications," in International Conference on Parallel Processing (Euro-Par), 2011.

Digital Library

[50]

H. Wang, V. Sathish, R. Singh, M. Schulte, and N. Kim, "Worload and Power Budgest Partitioning for Single Chip Heterogeneous Processors," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012.

Digital Library

[51]

S. Williams, A. Waterman, and D. Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM, 2009.

Digital Library

Cited By

Chen LLi XChen SJiang FLi CZhang WXu JAgrawal KPetrank E(2024)PC-oriented Prediction-based Runtime Power Management for GPGPU using Knowledge TransferProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659981(359-370)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659981
Kim WShim MJung HLee Y(2023)Aggressive SRAM Voltage Scaling and Error Mitigation for Approximate DNN InferenceProceedings of the 2nd Workshop on Smart Wearable Systems and Applications10.1145/3615592.3616852(28-34)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1145/3615592.3616852
Kim DLee JJung WSullivan MKim JMohror KArnold DBadia R(2023)Unity ECC: Unified Memory Protection Against Bit and Chip ErrorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607081(1-16)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607081
Show More Cited By

Index Terms

Harmonia: balancing compute and memory power in high-performance GPUs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Electronic design automation
    1. Physical design (EDA)
  2. Hardware validation

Recommendations

Harmonia: balancing compute and memory power in high-performance GPUs
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain ...
Harmonia: a high throughput B+tree for GPUs
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

B+tree is one of the most important data structures and has been widely used in different fields. With the increase of concurrent queries and data-scale in storage, designing an efficient B+tree structure has become critical. Due to abundant computation ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S

ISCA'15

June 2015

745 pages

ISSN:0163-5964

DOI:10.1145/2872887

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Published in SIGARCH Volume 43, Issue 3S

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
1,211
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)7

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen LLi XChen SJiang FLi CZhang WXu JAgrawal KPetrank E(2024)PC-oriented Prediction-based Runtime Power Management for GPGPU using Knowledge TransferProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659981(359-370)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659981
Kim WShim MJung HLee Y(2023)Aggressive SRAM Voltage Scaling and Error Mitigation for Approximate DNN InferenceProceedings of the 2nd Workshop on Smart Wearable Systems and Applications10.1145/3615592.3616852(28-34)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1145/3615592.3616852
Kim DLee JJung WSullivan MKim JMohror KArnold DBadia R(2023)Unity ECC: Unified Memory Protection Against Bit and Chip ErrorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607081(1-16)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607081
Haque Monil MLee SVetter JMalony A(2021)Comparing LLC-Memory Traffic between CPU and GPU Architectures2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)10.1109/RSDHA54838.2021.00007(8-16)Online publication date: Nov-2021
https://doi.org/10.1109/RSDHA54838.2021.00007
Arima EHanawa TTrinitis CSchulz M(2020)Footprint-Aware Power Capping for Hybrid Memory Based SystemsHigh Performance Computing10.1007/978-3-030-50743-5_18(347-369)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50743-5_18
Adhinarayanan VPaul IGreathouse JHuang WPattnaik AFeng W(2016)Measuring and modeling on-chip interconnect power on real hardware2016 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2016.7581263(1-11)Online publication date: Sep-2016
https://doi.org/10.1109/IISWC.2016.7581263
Zhang YWang QLin ZXu PWang B(2024)Improving GPU Energy Efficiency through an Application-transparent Frequency Scaling Policy with Performance AssuranceProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629584(769-785)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629584
Wang YHao MHe HZhang WTang QSun XWang Z(2024)DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement LearningIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33626979:5(712-726)Online publication date: Sep-2024
https://doi.org/10.1109/TSUSC.2024.3362697
Hussain HS T(2024)Analysis of Energy-Efficient LCRM Optimization Algorithm in Computer Vision-based CNNs2024 IEEE 8th Energy Conference (ENERGYCON)10.1109/ENERGYCON58629.2024.10488814(1-6)Online publication date: 4-Mar-2024
https://doi.org/10.1109/ENERGYCON58629.2024.10488814
Alawneh TSharadqh AAl Sharah AAwada EAlkasassbeh JAl-Rawashdeh AAl-Qaisi A(2024)A Highly Parallel DRAM Architecture to Mitigate Large Access Latency and Improve Energy Efficiency of Modern DRAM SystemsIEEE Access10.1109/ACCESS.2024.351217612(182998-183023)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3512176
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents