skip to main content
research-article

GPUWattch: enabling energy optimizations in GPGPUs

Published:23 June 2013Publication History
Skip Abstract Section

Abstract

General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

References

  1. MacSim, http://code.google.com/p/macsim.Google ScholarGoogle Scholar
  2. Predictive technology model, http://ptm.asu.edu.Google ScholarGoogle Scholar
  3. Synopsys Inc., Power Compiler, www.synopsys.com.Google ScholarGoogle Scholar
  4. A. Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  5. M. Bauer et al. CudaDMA: optimizing GPU memory bandwidth via warp specialization. In SC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Brooks et al. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Collange et al. Power consumption of GPUs from a software perspective. In ICCS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. J. Dally. Moving the needle, computer architecture research in academe and industry. In ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. M. V. Dyke et al. Graphics system with virtual memory pages and non-power of two number of memory elements, 2011.Google ScholarGoogle Scholar
  11. W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Fung et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Hong and H. Kim. An integrated GPU power and performance model. In ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Isci et al. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. In MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Jacobson et al. Stretching the limits of clock-gating efficiency in server-class processors. In HPCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Kailath, A. Sayed, and B. Hassibi. Linear Estimation. Prentice Hall, 2000.Google ScholarGoogle Scholar
  17. K. Kasichayanula et al. Power aware computing on GPUs. SAAHPC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Keckler. Life After Dennard and How I Learned to Love the Picojoule. In MICRO, 2012.Google ScholarGoogle Scholar
  19. W. Kim et al. System level analysis of fast, per-core DVFS using on-chip switching regulators. In HPCA, 2008.Google ScholarGoogle Scholar
  20. J. Lee et al. Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling. In PACT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Li et al. Deterministic clock gating for microprocessor power reduction. In HPCA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Li et al. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. Lindholm et al. NVIDIA Tesla: A unified graphics and computing architecture. Micro, IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. E. Lindholm et al. Simulating multiported memories using lower port count memories, 2008.Google ScholarGoogle Scholar
  25. S. Liu et al. Operand collector architecture, 2010.Google ScholarGoogle Scholar
  26. H. Nagasaka et al. Statistical power modeling of GPU kernels using performance counters. In Green Computing Conference, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Narasiman et al. Improving GPU performance via large warps and two-level warp scheduling. In MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. NVIDIA. Fermi Compute Architecture Whitepaper, 2009.Google ScholarGoogle Scholar
  29. NVIDIA. Compute Visual Profiler - User Guide, Version 4, 2011.Google ScholarGoogle Scholar
  30. NVIDIA. NVIDIA CUDA C Programming Guide, 2012.Google ScholarGoogle Scholar
  31. H.-J. Oh et al. A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor. JSSC, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  32. V. Sathish et al. Lossless and lossy memory-link compression techniques for improving performance of memory-bound GPGPU workloads. In PACT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Thoziyoor et al. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Ubal et al. Multi2Sim: A simulation framework for CPU-GPU computing. In PACT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Vogelsang. Understanding the energy consumption of dynamic random access memories. In MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Wang and Q. Chen. Power estimating model and analysis of general programming on GPU. Journal of Software, 2012.Google ScholarGoogle Scholar
  37. Q. Wu et al. A dynamic compilation framework for controlling microprocessor energy and performance. In MICRO, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Zhang et al. Performance and power analysis of ATI GPU: A statistical approach. In NSA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GPUWattch: enabling energy optimizations in GPGPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 41, Issue 3
        ICSA '13
        June 2013
        666 pages
        ISSN:0163-5964
        DOI:10.1145/2508148
        Issue’s Table of Contents
        • cover image ACM Other conferences
          ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
          June 2013
          686 pages
          ISBN:9781450320795
          DOI:10.1145/2485922

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 June 2013

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader