skip to main content
research-article

Harmonia: balancing compute and memory power in high-performance GPUs

Published: 13 June 2015 Publication History

Abstract

In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation.
Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables---compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED2) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.

References

[1]
AMD, "PowerTune Technology whitepaper, 2010."
[2]
M. Arora, S. Nath, S. Mazumdar, S. Baden, and D. Tullsen, "Redefining the Role of the CPU in the Era of CPU-GPU Integration," IEEE Micro, 2012.
[3]
K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley," Technical Report UCB/EECS-183.2006, 2006.
[4]
W. L. Bircher, M. Valluri, J. Law, and L. John, "Runtime Identification of Microprocessor Energy Saving Opportunities," in International Symp. on Low Power Electronics and Design (ISLPED), 2005.
[5]
W. Brown, P. Wang, S. Plimpton, and A. Tharrington, "Implementing Molecular Dynamics on Hybrid High Performance Computers---Short Range Forces," Compute Physics Communications, 2011.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IEEE Intl. Symp. on Workload Characterization, 2009.
[7]
S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in IEEE Intl. Symp. on Workload Characterization, 2011.
[8]
M. Chen, X. Wang, and X. Li, "Coordinating Processor and Main Memory for Efficient Server Power Control," in International Conference on Supercomputing (ICS), 2011.
[9]
J. Choi, D. Bedard, R. Fowler, and R. Vuduc, "A Roofline Model of Energy," in IEEE International Distributed Process Symposium, 2013.
[10]
CodeXL, "http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/."
[11]
M. Daga and M. Nutter, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on APUs," in Workshop on Irregular Applications, Architectures and Algorithms (IA3), 2012.
[12]
A. Danalis, G. Marin, C. McCurdy, J. Meredith, P. Roth, K. Spafford, V. Tipparaju, and J. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmarking Suite," in Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), 2010.
[13]
H. David, C. Fallin, E. Gorbatov, U. Hanebutte, and O. Mutlu, "Memory Power Management vis Dynamic Voltage/Frequency Scaling," in International Conference on Autonomous Computing (ICAC), 2011.
[14]
H. David, E. Gorbatov, U. Hanebutte, K. Khanna, and C. Le, "RAPL: Memory Power Estimation and Capping," in International Symposium on Low Power Electronics and Design (ISLPED), 2010.
[15]
Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, "CoScale: Coordinating CPU and Memory System DVFS in Server Systems," in International Symposium on Microarchitecture (MICRO), 2012.
[16]
Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, "MultiScale: Memory System DVFS with Multiple Memory Controllers," in International Symposium on Low Power Electronics and Design (ISLPED), 2012.
[17]
Q. Deng, D. Meisner, L. Ramos, T. Wenisch, and R. Bianchini, "Mem-Scale: Active Low-Power Modes for Main Memory," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.
[18]
B. Diniz, D. Guedez, W. Meira, and R. Bianchini, "Limiting the Power Consumption of Main Memory," in International Symposium on Computer Architecture (ISCA), 2007.
[19]
Elpida, "http://www.elpida.com/en/news/2011/06-27.html."
[20]
W. Felter, K. Rajamani, T. Keller, and C. Rusu, "A Performance-Conserving Approach for Reducing Peak Power Consumption in Server Systems," in International Conference on Supercomputing (ICS), 2005.
[21]
Green500 List, "http://www.green500.org."
[22]
M. Heroux, D. Doerfler, P. Crozier, J. Willenbring, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich, "Improving Performance via Mini-applications," Sandia Report, SAND2009-5574, 2009.
[23]
S. Hong and H. Kim, "An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness," in International Symposium on Computer Architecture (ISCA), 2009.
[24]
C. Hsu and W. Feng, "Effective Dynamic Voltage Scaling through CPU-Boundedness Detection," Lec. Notes in Computer Science, 2004.
[25]
W. Huang, M. Stan, K. Sankaranarayanan, R. Ribando, and K. Skadron, "Many-core Design from a Thermal Perspective," in Design Automation Conference (DAC), 2008.
[26]
JEDECWide I/O, "http://www.jedec.org/news/pressreleases/jedecpublishes-breakthrough-standard-wide-io-mobile-dram, jan 2012."
[27]
S. Kaxiras and M. Martonosi, "Computer Architecture Techniques for Power Efficiency," Synth. Lec. on Computer Architecture, 2008.
[28]
S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, 2011.
[29]
G. Kestor, R. Gioiosa, D. Kerbyson, and A. Hoisie, "Quantifying the Energy Cost of Data Movement in Scientific Applications," in International Symposium on Workload Characterization (IISWC), 2013.
[30]
J. Laros, K. Pedretti, S. Kelly, W. Shu, and C. Vaughan, "Energy Based Performance Tuning for Large Scale High Performance Computing Systems," in Symp. on High-Performance Computing, 2012.
[31]
J. Lee and H. Kim, "TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture," in International Conference on High-Performance Computer Architecture (HPCA), 2012.
[32]
J. Lee, V. Sathisha, M. Schulte, K. Compton, and N. S. Kim, "Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2011.
[33]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in International Symposium on Computer Architecture (ISCA), 2013.
[34]
C. Luk, S. Hong, and H. Kim, "Qilin: Exploiting Parallelism on Hetergeneous Multiprocessors with Adaptive Mapping," in International Symposium on Microarchitecture (MICRO), 2009.
[35]
M. Mantor and M. Houston, "AMD Graphics Core Next," in AMD Fusion Developer Summit, 2011.
[36]
A. McLaughlin, I. Paul, J. Greathouse, S. Manne, and S. Yalamanchili, "A Power Characterization and Management of GPU Graph Traversal," in Workshop on Architectures and Systems for Big Data, 2014.
[37]
R. Murphy, K. Wheeler, B. Barett, and J. Ang, "Introduing the Graph500," Cray User's Group (CUG), 2010.
[38]
Online, "http://www.anandtech.com/show/8217/intels-knights-landing-coprocessor-detailed."
[39]
Online, "http://www.techspot.com/news/52003-future-nvidia-volta-gpu-has-stacked-dram-offers-1tb-s-bandwidth.html, march 2013."
[40]
S. Pakin, C. Storlie, M. Lang, R. Fields, E. Romero, C. Idler, S. Michalak, H. Greeberg, J. Loncaric, R. Rheinheimer, G. Grider, and J. Wendelberger, "Power Usage of Production Supercomputers and Production Workloads," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2012.
[41]
I. Paul, S. Manne, M. Arora, W. L. Bircher, and S. Yalamanchili, "Cooperative Boost: Needy vs. Greedy Power Management," in International Symposium on Computer Architecture (ISCA), 2013.
[42]
I. Paul, V. Ravi, S. Manne, M. Arora, and S. Yalamanchili, "Coordinated Energy Management in Heterogeneous Processors," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2013.
[43]
J. Pawlowski, "Hybrid Memory Cube (HMC)," in HotChips, 2011.
[44]
E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weisman, "Power Management Architectures of the Intel Microarchitecture Code-Named Sandy Bridge," IEEE Micro, 2012.
[45]
B. Rountree, D. Lowenthal, B. de Supinski, M. Schulz, V. Freeh, and T. Bletsch, "Adagio: Making DVS Practical for Complex HPC Applications," in International Conference on Supercomputing (ICS), 2009.
[46]
B. Rountree, D. Lowenthal, S. Funk, V. Freeh, B. de Supinski, and M. Schulz, "Bounding Energy Consumption in Large-Scale MPI Programs," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2007.
[47]
J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in International Conference on High Performance Computing for Computational Science, 2010.
[48]
A. Sharifi, A. K. Mishra, S. Srikantaiah, M. Kandemir, and C. R. Das, "PEPON: performance-aware hierarchical power budgeting for NoC based multicores," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012.
[49]
A. Tiwari, M. Laurenzano, L. Carrington, and A. Snavely, "Autotuning for Energy Usage in Scientific Applications," in International Conference on Parallel Processing (Euro-Par), 2011.
[50]
H. Wang, V. Sathish, R. Singh, M. Schulte, and N. Kim, "Worload and Power Budgest Partitioning for Single Chip Heterogeneous Processors," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012.
[51]
S. Williams, A. Waterman, and D. Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM, 2009.

Cited By

View all
  • (2024)PC-oriented Prediction-based Runtime Power Management for GPGPU using Knowledge TransferProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659981(359-370)Online publication date: 17-Jun-2024
  • (2023)Aggressive SRAM Voltage Scaling and Error Mitigation for Approximate DNN InferenceProceedings of the 2nd Workshop on Smart Wearable Systems and Applications10.1145/3615592.3616852(28-34)Online publication date: 6-Oct-2023
  • (2023)Unity ECC: Unified Memory Protection Against Bit and Chip ErrorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607081(1-16)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
    June 2015
    768 pages
    ISBN:9781450334020
    DOI:10.1145/2749469
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015
Published in SIGARCH Volume 43, Issue 3S

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)7
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PC-oriented Prediction-based Runtime Power Management for GPGPU using Knowledge TransferProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659981(359-370)Online publication date: 17-Jun-2024
  • (2023)Aggressive SRAM Voltage Scaling and Error Mitigation for Approximate DNN InferenceProceedings of the 2nd Workshop on Smart Wearable Systems and Applications10.1145/3615592.3616852(28-34)Online publication date: 6-Oct-2023
  • (2023)Unity ECC: Unified Memory Protection Against Bit and Chip ErrorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607081(1-16)Online publication date: 12-Nov-2023
  • (2021)Comparing LLC-Memory Traffic between CPU and GPU Architectures2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)10.1109/RSDHA54838.2021.00007(8-16)Online publication date: Nov-2021
  • (2020)Footprint-Aware Power Capping for Hybrid Memory Based SystemsHigh Performance Computing10.1007/978-3-030-50743-5_18(347-369)Online publication date: 15-Jun-2020
  • (2016)Measuring and modeling on-chip interconnect power on real hardware2016 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2016.7581263(1-11)Online publication date: Sep-2016
  • (2024)Improving GPU Energy Efficiency through an Application-transparent Frequency Scaling Policy with Performance AssuranceProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629584(769-785)Online publication date: 22-Apr-2024
  • (2024)DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement LearningIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33626979:5(712-726)Online publication date: Sep-2024
  • (2024)Analysis of Energy-Efficient LCRM Optimization Algorithm in Computer Vision-based CNNs2024 IEEE 8th Energy Conference (ENERGYCON)10.1109/ENERGYCON58629.2024.10488814(1-6)Online publication date: 4-Mar-2024
  • (2024)A Highly Parallel DRAM Architecture to Mitigate Large Access Latency and Improve Energy Efficiency of Modern DRAM SystemsIEEE Access10.1109/ACCESS.2024.351217612(182998-183023)Online publication date: 2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media