ABSTRACT
Massively parallel computation in GPUs significantly boosts performance of compute-intensive applications but creates power and thermal issues that limit further performance scaling. This paper demonstrates significant GPGPU power savings by relaxing application accuracy requirements and enabling the use of low power imprecise hardware (IHW). A synthesized set of novel imprecise floating point arithmetic units is presented. GPGPU-Sim and GPUWattch are used to estimate impacts of IHW units on output quality and system-level power consumption, providing a quality-power tradeoff model for application-specific optimization. Experimental results for a 45 nm process show up to 32% power savings with negligible impacts on output quality.
- NVIDIA, "Whitepaper NVIDIA's Next Generation CUDA Compute Architecture," pp. 1--22, 2009, URL: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdfGoogle Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," IISWC, pp. 44--54, Oct. 2009 Google ScholarDigital Library
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," ISPASS, pp. 163--174, Apr. 2009Google Scholar
- J. Leng, T. Hetherington, A. Eltantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling energy optimizations in GPGPUs," ISCA, pp. 487--498, June 2013 Google ScholarDigital Library
- A. B. Kahng and S. Kang, "Accuracy-configurable adder for approximate arithmetic designs," DAC, pp. 820--825, June 2012 Google ScholarDigital Library
- M. Weber, M. Putic, H. Zhang, and J. Lach, "Balancing adder for error tolerant applications," ISCAS, pp. 3038--3041, May 2013Google Scholar
- K. Du, P. Varman, and K. Mohanram, "Static window addition: A new paradigm for the design of variable latency adders," ICCD, pp. 455--456, Oct. 2011 Google ScholarDigital Library
- V. Gupta, D. Mohapatra, S. P. Park, A. Raghunathan, and K. Roy, "IMPACT: IMPrecise adders for low-power approximate computing," ISLPED, pp. 409--414, Aug. 2011 Google ScholarDigital Library
- K. E. Wires, M. J. Schulte, and J. E. Stine, "Variable-correction truncated floating point multipliers," ACSSC, pp. 1344--1348, Oct.-Nov. 2000Google Scholar
- A. Gupta, S. Mandavalli, V. J. Mooney, K.-V. Ling, A. Basu, H. Johan, and B. Tandianus, "Low power probabilistic floating point multiplier design," ISVLSI, pp. 182--187, July 2011 Google ScholarDigital Library
- J. Ying, F. Tong, D. Nagle, and R. A. Rutenbar, "Reducing power by optimizing the necessary precision / range of floating-point arithmetic," IEEE TVLSI, vol. 8, no. 3, pp. 273--286, June 2000 Google ScholarDigital Library
- K. Pillai, R. V. K. Pillai, D. Al-Khalili, and a. J. Al-Khalili, "A low power approach to floating point adder design," ICCD, pp. 178--185, Oct. 1997 Google ScholarDigital Library
- J. Won and K. Choi, "Low power self-timed floating-point divider in 0.25 um technology," ESSCIRC, pp. 113--116, Sept. 2000Google Scholar
- M. Kuhlmann and K. K. Parhi, "Fast low-power shared division and square-root architecture," ICCD, pp. 128--135, Oct. 1998 Google ScholarDigital Library
- V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, "Low-power digital signal processing using approximate adders," IEEE TCAD, vol. 32, no. 1, pp. 124--137, Jan. 2013Google ScholarDigital Library
- N. R. Shanbhag and S. Member, "Reliable low-power digital signal processing via educed precision redundancy," IEEE TVLSI, vol. 12, no. 5, pp.497--510, May 2004 Google ScholarDigital Library
- J. Pool, A. Lastra, M. Singh, and N. C. Hill, "Energy-precision tradeoffs in mobile graphics processing units," ICCD, pp. 60--67, Oct. 2008Google Scholar
- M. D. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann, Oxford, Elsevier Science, 2004Google Scholar
- R. E. Caflisch, "Monte Carlo and quasi-Monte Carlo methods," Acta Numerica, vol. 7, pp. 1--49, Jan. 1998Google ScholarCross Ref
- S. Li, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," Mirco, pp. 469--480, Dec. 2009 Google ScholarDigital Library
- K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, "Temperature-aware microarchitecture." ISCA, pp. 2--13, 2003 Google ScholarDigital Library
- Y. Yu and S. T. Acton, "Speckle reducing anisotropic diffusion," IEEE TIP, vol. 11, no. 11, pp. 1260--1270, Jan. 2002 Google ScholarDigital Library
- A. J. Pinho, D. Electrnica, and T. Inesc, "Figures of merit for quality assessment of binary edge maps," ICIP, vol. 3, pp. 591--594, Sept. 1996Google Scholar
- Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IEEE TIP, vol. 13, no. 4, pp. 600--612, Apr. 2004 Google ScholarDigital Library
Index Terms
Low Power GPGPU Computation with Imprecise Hardware
Recommendations
ARGA: Approximate Reuse for GPGPU Acceleration
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019Many data-driven applications including computer vision, speech recognition, and medical diagnostics show tolerance to error during computation. These applications are often accelerated on GPUs, but high computational costs limit performance and ...
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture
Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
A unified optimizing compiler framework for different GPGPU architectures
This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
Comments