Abstract
Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units.
The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.
- Horowitz, M.; Alon, E.; Patil, D.; Naffziger, S.; Kumar, R.; Bernstein, K., "Scaling, Power and the Future of CMOS," 20th Int'l Conference on VLSI Design, 2007, held jointly with 6th Int'l Conference on Embedded Systems, p. 23, 6--10 Jan. 2007 Google ScholarDigital Library
- Solomatnikov, A; Firoozshahian, A.; Qadeer,W.; Shacham, O.; Kelley, K.; Asgar, Z.; Wachs, M.; Hameed, R.; Horowitz, M., "Chip Multi-Processor Generator," Proceedings of the 44th Annual Design Automation Conference, 2007, pp. 262--263 Google ScholarDigital Library
- Iverson, V.; McVeigh, J.; Reese, B., "Real-time H.264/avc Codec on Intel architectures," IEEE Int. Conf. Image Processing (ICIP'04), 2004.Google Scholar
- Chen, T.-C.; et al., "Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder," Circuits and Systems for Video Technology, IEEE Transactions on, vol.16, no.6, pp. 673--688, June 2006. Google ScholarDigital Library
- Davis, W.R.; Zhang, N.; Camera, K.; Markovic, D.; Smilkstein, T.; Ammer, M.J.; Yeo, E.; Augsburger, S.; Nikolic, B.; Brodersen, R.W., "A design environment for high-throughput low-power dedicated signal processing systems," Solid-State Circuits, IEEE Journal of, vol.37, no.3, pp.420--431, Mar 2002.Google ScholarCross Ref
- Balfour, J.; Dally, W.J.; Black-Schaffer, D.; Parikh, V.; Park, J.S., "An Energy-Efficient Processor Architecture for Embedded Systems," Computer Architecture Letters, vol.7, no.1, pp.29--32, January-June 2007. Google ScholarDigital Library
- McCloud, S.; "Catapult C Synthesis-Based Design Flow: Speeding Implementation and Increasing Flexibility", Mentor Graphics Technical Library, http://www.techonline.com/electronics_directory/techpaper/193102520, August 2004.Google Scholar
- Kathail,V., "Creating power-efficient application engines for SoC design," Synfora, Inc. SoC Central, Feb 1, 2005.Google Scholar
- Rowen, C.; Leibson, S., "Flexible architectures for engineering successful SOCs," Design Automation Conf, 2004. Proceedings. 41st, pp. 692--697, 2004. Google ScholarDigital Library
- Woh, M., Seo, S., Mahlke, S., Mudge, T., Chakrabarti, C., and Flautner, K. 2009. AnySP: anytime anywhere anyway signal processing. SIGARCH Comp. Arch. News 37, 3 (Jun. 2009), 128--139. Google ScholarDigital Library
- Intel Corp., Motion Estimation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4). {Online}. Available: http://software.intel.com/en-us/articles/motion-estimation-with-intel-streaming-simd-extensions-4-intel-sse4/Google Scholar
- Intel Corp., Intel SSE4 Programming Reference. {Online}. Available: http://softwarecommunity.intel.com/isn/Downloads/Intel%20SSE4%20Programming%20Reference.pdf.Google Scholar
- Clark, N.T.; Zhong, H.; Mahlke, S.A., "Automated custom instruction generation for domain-specific processor acceleration," Computers, IEEE Transactions on, vol.54, no.10, pp. 1258--1270, Oct. 2005. Google ScholarDigital Library
- Cong, J., Fan, Y., Han, G., and Zhang, Z. Application-specific instruction generation for configurable processor architectures. In Proceedings of the 2004 ACM/SIGDA 12th international Symposium on Field Programmable Gate Arrays (Monterey, California, USA, February 22-24, 2004). FPGA '04. ACM, New York, NY, 183--189. Google ScholarDigital Library
- Shojania, H.; Sudharsanan, S., "A VLSI Architecture for High-Performance CABAC Encoding," Visual Communications and Image Processing (SPIE), 2005. Proceedings. vol. 5960, June 2005.Google Scholar
- Ienne, P.; Leupers, R. Customizable Embedded Processors: Design Technologies and Applications (Systems on Silicon). Morgan Kaufmann Publishers Inc. 2006 Google ScholarDigital Library
- Tensilica Inc., "Xtensa LX2 Benchmarks" {Online}. Available: http://www.tensilica.com/products/xtensa-customizable/xtensa-lx2/benchmarks.htmGoogle Scholar
- Tensilica Inc., "The What, Why, and How of Configurable Processors." {Online}. Available: http://www.tensilica.com/products/literature-docs/white-papers/configurable-processors.htmGoogle Scholar
- Tensilica Inc., "Implementing the Advanced Encryption Standard on Xtensa® Processors", Application note. {Online}. Available: http://www.tensilica.com/products/ literature-docs/application-notes/tie-application-notes/advanced-encryption-standard.htmGoogle Scholar
- Tensilica Inc., "Implementing the Fast Fourier Transform (FFT)", Application note. {Online}. Available: http://www.tensilica.com/products/literature-docs/application-notes/tie-application-notes/fast-fourier-transform-fft.htmGoogle Scholar
- Tensilica Inc., "Xtensa Processor Extensions for Data Encryption Standard (DES)" Application note. {Online}. Available: http://www.tensilica.com/products/literature-docs/application-notes/tie-application-notes/data-encryption-extensions.htmGoogle Scholar
- Tensilica Inc., "How to Minimize Energy Consumption while Maximizing ASIC and SOC Performance" White Paper. Available: http://www.tensilica.com/uploads/white_papers/Xenergy_Tensilica.pdfGoogle Scholar
- Weigand, T.; Sullivan, G.; Bjontegaard, G.; Luthra, A., "Overview of the H.264/AVC Coding Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol 13, no 7, July 2003. Google ScholarDigital Library
- Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, Joint Video Team, ITU-T Recommendation H.264 and ISO/IEC 14496-10 AVC, May 2003.Google Scholar
- Tensilica Inc., "Xtensa Energy Estimator (Xenergy) --- User's Guide"Google Scholar
- Cheveresan, R.;, Ramsay, M.; Feucht, C.; Sharapov, I., "Characteristics of workloads used in high performance and technical computing," Proceedings of the 21st annual international conference on Supercomputing, June 17--21, 2007. Google ScholarDigital Library
- Rupnow, K.; Rodrigues A.;, Underwood, K.; Compton, K., "Scientific applications vs. SPEC-FP: a comparison of program behavior," Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006. Google ScholarDigital Library
- Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., and Demmel, J., "Optimization of sparse matrix-vector multiplication on emerging multicore platforms," Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ACM New York, NY, USA, 2007 Google ScholarDigital Library
- Lin, Y.-K., Li, D.-W., Lin, C.-C., Kuo, T.-Y., Wu, S.-J.; Tai, W.-C.; Chang, W.-C. and Chang, T.-S., "A 242mw 10mm2 1080p H.264/AVC high profile encoder chip," Proceedings of the 45th Design Automation Conference, 2008 Google ScholarDigital Library
- Chang, H.-C., Chen, J.-W., Su, C.-L., Yang, Y.-C., Li, Y., Chang, C.-H., Chen, Z.-M., Yang, W.-S., Lin, C.-C., Chen, C.-W., Wang, J.-S. and Guo, J.-I., "A 7mw-to-183mw dynamic quality-scalable h.264 video encoder chip," Proceedings of 2007 IEEE ISSCC Dig. Tech. PapersGoogle Scholar
- ISO/IEC MPEG & ITU-T VCEG, "Fast Integer Pel and Fractional Pel Motion Estimation for JVT", JVT-F017, 2002Google Scholar
- Yin, P, Tourapis, H.-Y. C., Tourapis, A. M., and Boyce, J., "Fast mode decision and motion estimation for JVT/H.264", Proceedings of IEEE International Conference on Image Processing, 2003Google Scholar
- Chen, C.-Y., Chien, S.-Y., Huang, Y.-W., Chen, T.-C., Wang, T. C. and Chen, L.-Y., "Analysis and Architecture Design of Variable Block-Size Motion Estimation for H.264/AVC", IEEE Transactions on Circuits and Systems, 2006Google Scholar
- Li, S., Wei, X., Ikenaga, T. and Goto, S., "A VLSI architecture design of an edge based fast intra prediction mode decision algorithm for h.264/avc", Proceedings of the 17th ACM Great Lakes Symposium on VLSI Google ScholarDigital Library
- Chen, T-C., Huang, Y.-W. and Chen, L.-G., "Fully Utilized And Reusable Architecture For Fractional Motion Estimation Of H.264/Avc", Proceedings of IEEE International Conference On Acoustics Speech And Signal Processing, 2004Google Scholar
- Osorio, R. R., Bruguera, J. D., "High-Throughput Architecture for H.264/AVC CABAC Compression System", IEEE Transactions on Circuits and Systems for Video Technology, 2006 Google ScholarDigital Library
- Chen, Y.-K., Li, E. Q., Zhou, X. and Ge, S., "Implementation of H.264 encoder and decoder on personal computers", Journal of Visual Communication and Image Representation, April 2006Google ScholarCross Ref
- Joint Video Team Reference Software JM8.6, ITU-TGoogle Scholar
- Firoozshahian, A., Solomatnikov, A., Shacham, O., Asgar, Z., Richardson, S., Kozyrakis, C., Horowitz, M., "A memory system design framework: creating smart memories", Proceedings of the 36th annual international symposium on Computer architecture, 2009 Google ScholarDigital Library
- Solomatnikov, A., Firoozshahian, A. Shacham, O., Asgar, Z., Wachs, M, Qadeer, W, Richardson, S. and Horowitz, M., "Using a Configurable Processor Generator for Computer Architecture Prototyping", Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. Google ScholarDigital Library
Index Terms
- Understanding sources of inefficiency in general-purpose chips
Recommendations
Understanding sources of inefficiency in general-purpose chips
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureDue to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance ...
Understanding sources of ineffciency in general-purpose chips
Scaling the performance of a power limited processor requires decreasing the energy expended per instruction executed, since energy/op * op/second is power. To better understand what improvement in processor efficiency is possible, and what must be done ...
Rethinking Digital Design: Why Design Must Change
Because of technology scaling, power dissipation is today's major performance limiter. Moreover, the traditional way to achieve power efficiency, application-specific designs, is prohibitively expensive. These power and cost issues necessitate ...
Comments