Abstract
General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.
- MacSim, http://code.google.com/p/macsim.Google Scholar
- Predictive technology model, http://ptm.asu.edu.Google Scholar
- Synopsys Inc., Power Compiler, www.synopsys.com.Google Scholar
- A. Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, 2009.Google ScholarCross Ref
- M. Bauer et al. CudaDMA: optimizing GPU memory bandwidth via warp specialization. In SC, 2011. Google ScholarDigital Library
- D. Brooks et al. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA, 2000. Google ScholarDigital Library
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarDigital Library
- S. Collange et al. Power consumption of GPUs from a software perspective. In ICCS, 2009. Google ScholarDigital Library
- W. J. Dally. Moving the needle, computer architecture research in academe and industry. In ISCA, 2010. Google ScholarDigital Library
- J. M. V. Dyke et al. Graphics system with virtual memory pages and non-power of two number of memory elements, 2011.Google Scholar
- W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA, 2011. Google ScholarDigital Library
- W. Fung et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, 2007. Google ScholarDigital Library
- S. Hong and H. Kim. An integrated GPU power and performance model. In ISCA, 2010. Google ScholarDigital Library
- C. Isci et al. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. In MICRO, 2006. Google ScholarDigital Library
- H. Jacobson et al. Stretching the limits of clock-gating efficiency in server-class processors. In HPCA, 2005. Google ScholarDigital Library
- T. Kailath, A. Sayed, and B. Hassibi. Linear Estimation. Prentice Hall, 2000.Google Scholar
- K. Kasichayanula et al. Power aware computing on GPUs. SAAHPC, 2012. Google ScholarDigital Library
- S. Keckler. Life After Dennard and How I Learned to Love the Picojoule. In MICRO, 2012.Google Scholar
- W. Kim et al. System level analysis of fast, per-core DVFS using on-chip switching regulators. In HPCA, 2008.Google Scholar
- J. Lee et al. Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling. In PACT, 2011. Google ScholarDigital Library
- H. Li et al. Deterministic clock gating for microprocessor power reduction. In HPCA, 2003. Google ScholarDigital Library
- S. Li et al. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO, 2009. Google ScholarDigital Library
- E. Lindholm et al. NVIDIA Tesla: A unified graphics and computing architecture. Micro, IEEE, 2008. Google ScholarDigital Library
- J. E. Lindholm et al. Simulating multiported memories using lower port count memories, 2008.Google Scholar
- S. Liu et al. Operand collector architecture, 2010.Google Scholar
- H. Nagasaka et al. Statistical power modeling of GPU kernels using performance counters. In Green Computing Conference, 2010. Google ScholarDigital Library
- V. Narasiman et al. Improving GPU performance via large warps and two-level warp scheduling. In MICRO, 2011. Google ScholarDigital Library
- NVIDIA. Fermi Compute Architecture Whitepaper, 2009.Google Scholar
- NVIDIA. Compute Visual Profiler - User Guide, Version 4, 2011.Google Scholar
- NVIDIA. NVIDIA CUDA C Programming Guide, 2012.Google Scholar
- H.-J. Oh et al. A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor. JSSC, 2006.Google ScholarCross Ref
- V. Sathish et al. Lossless and lossy memory-link compression techniques for improving performance of memory-bound GPGPU workloads. In PACT, 2012. Google ScholarDigital Library
- S. Thoziyoor et al. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In ISCA, 2008. Google ScholarDigital Library
- R. Ubal et al. Multi2Sim: A simulation framework for CPU-GPU computing. In PACT, 2012. Google ScholarDigital Library
- T. Vogelsang. Understanding the energy consumption of dynamic random access memories. In MICRO, 2010. Google ScholarDigital Library
- H. Wang and Q. Chen. Power estimating model and analysis of general programming on GPU. Journal of Software, 2012.Google Scholar
- Q. Wu et al. A dynamic compilation framework for controlling microprocessor energy and performance. In MICRO, 2005. Google ScholarDigital Library
- Y. Zhang et al. Performance and power analysis of ATI GPU: A statistical approach. In NSA, 2011. Google ScholarDigital Library
Index Terms
- GPUWattch: enabling energy optimizations in GPGPUs
Recommendations
An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureGPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
GPUWattch: enabling energy optimizations in GPGPUs
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureGeneral-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly ...
An integrated GPU power and performance model
ISCA '10GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
Comments