skip to main content
research-article

Understanding sources of inefficiency in general-purpose chips

Published:19 June 2010Publication History
Skip Abstract Section

Abstract

Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units.

The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

References

  1. Horowitz, M.; Alon, E.; Patil, D.; Naffziger, S.; Kumar, R.; Bernstein, K., "Scaling, Power and the Future of CMOS," 20th Int'l Conference on VLSI Design, 2007, held jointly with 6th Int'l Conference on Embedded Systems, p. 23, 6--10 Jan. 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Solomatnikov, A; Firoozshahian, A.; Qadeer,W.; Shacham, O.; Kelley, K.; Asgar, Z.; Wachs, M.; Hameed, R.; Horowitz, M., "Chip Multi-Processor Generator," Proceedings of the 44th Annual Design Automation Conference, 2007, pp. 262--263 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Iverson, V.; McVeigh, J.; Reese, B., "Real-time H.264/avc Codec on Intel architectures," IEEE Int. Conf. Image Processing (ICIP'04), 2004.Google ScholarGoogle Scholar
  4. Chen, T.-C.; et al., "Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder," Circuits and Systems for Video Technology, IEEE Transactions on, vol.16, no.6, pp. 673--688, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Davis, W.R.; Zhang, N.; Camera, K.; Markovic, D.; Smilkstein, T.; Ammer, M.J.; Yeo, E.; Augsburger, S.; Nikolic, B.; Brodersen, R.W., "A design environment for high-throughput low-power dedicated signal processing systems," Solid-State Circuits, IEEE Journal of, vol.37, no.3, pp.420--431, Mar 2002.Google ScholarGoogle ScholarCross RefCross Ref
  6. Balfour, J.; Dally, W.J.; Black-Schaffer, D.; Parikh, V.; Park, J.S., "An Energy-Efficient Processor Architecture for Embedded Systems," Computer Architecture Letters, vol.7, no.1, pp.29--32, January-June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. McCloud, S.; "Catapult C Synthesis-Based Design Flow: Speeding Implementation and Increasing Flexibility", Mentor Graphics Technical Library, http://www.techonline.com/electronics_directory/techpaper/193102520, August 2004.Google ScholarGoogle Scholar
  8. Kathail,V., "Creating power-efficient application engines for SoC design," Synfora, Inc. SoC Central, Feb 1, 2005.Google ScholarGoogle Scholar
  9. Rowen, C.; Leibson, S., "Flexible architectures for engineering successful SOCs," Design Automation Conf, 2004. Proceedings. 41st, pp. 692--697, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Woh, M., Seo, S., Mahlke, S., Mudge, T., Chakrabarti, C., and Flautner, K. 2009. AnySP: anytime anywhere anyway signal processing. SIGARCH Comp. Arch. News 37, 3 (Jun. 2009), 128--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Intel Corp., Motion Estimation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4). {Online}. Available: http://software.intel.com/en-us/articles/motion-estimation-with-intel-streaming-simd-extensions-4-intel-sse4/Google ScholarGoogle Scholar
  12. Intel Corp., Intel SSE4 Programming Reference. {Online}. Available: http://softwarecommunity.intel.com/isn/Downloads/Intel%20SSE4%20Programming%20Reference.pdf.Google ScholarGoogle Scholar
  13. Clark, N.T.; Zhong, H.; Mahlke, S.A., "Automated custom instruction generation for domain-specific processor acceleration," Computers, IEEE Transactions on, vol.54, no.10, pp. 1258--1270, Oct. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cong, J., Fan, Y., Han, G., and Zhang, Z. Application-specific instruction generation for configurable processor architectures. In Proceedings of the 2004 ACM/SIGDA 12th international Symposium on Field Programmable Gate Arrays (Monterey, California, USA, February 22-24, 2004). FPGA '04. ACM, New York, NY, 183--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Shojania, H.; Sudharsanan, S., "A VLSI Architecture for High-Performance CABAC Encoding," Visual Communications and Image Processing (SPIE), 2005. Proceedings. vol. 5960, June 2005.Google ScholarGoogle Scholar
  16. Ienne, P.; Leupers, R. Customizable Embedded Processors: Design Technologies and Applications (Systems on Silicon). Morgan Kaufmann Publishers Inc. 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tensilica Inc., "Xtensa LX2 Benchmarks" {Online}. Available: http://www.tensilica.com/products/xtensa-customizable/xtensa-lx2/benchmarks.htmGoogle ScholarGoogle Scholar
  18. Tensilica Inc., "The What, Why, and How of Configurable Processors." {Online}. Available: http://www.tensilica.com/products/literature-docs/white-papers/configurable-processors.htmGoogle ScholarGoogle Scholar
  19. Tensilica Inc., "Implementing the Advanced Encryption Standard on Xtensa® Processors", Application note. {Online}. Available: http://www.tensilica.com/products/ literature-docs/application-notes/tie-application-notes/advanced-encryption-standard.htmGoogle ScholarGoogle Scholar
  20. Tensilica Inc., "Implementing the Fast Fourier Transform (FFT)", Application note. {Online}. Available: http://www.tensilica.com/products/literature-docs/application-notes/tie-application-notes/fast-fourier-transform-fft.htmGoogle ScholarGoogle Scholar
  21. Tensilica Inc., "Xtensa Processor Extensions for Data Encryption Standard (DES)" Application note. {Online}. Available: http://www.tensilica.com/products/literature-docs/application-notes/tie-application-notes/data-encryption-extensions.htmGoogle ScholarGoogle Scholar
  22. Tensilica Inc., "How to Minimize Energy Consumption while Maximizing ASIC and SOC Performance" White Paper. Available: http://www.tensilica.com/uploads/white_papers/Xenergy_Tensilica.pdfGoogle ScholarGoogle Scholar
  23. Weigand, T.; Sullivan, G.; Bjontegaard, G.; Luthra, A., "Overview of the H.264/AVC Coding Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol 13, no 7, July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, Joint Video Team, ITU-T Recommendation H.264 and ISO/IEC 14496-10 AVC, May 2003.Google ScholarGoogle Scholar
  25. Tensilica Inc., "Xtensa Energy Estimator (Xenergy) --- User's Guide"Google ScholarGoogle Scholar
  26. Cheveresan, R.;, Ramsay, M.; Feucht, C.; Sharapov, I., "Characteristics of workloads used in high performance and technical computing," Proceedings of the 21st annual international conference on Supercomputing, June 17--21, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Rupnow, K.; Rodrigues A.;, Underwood, K.; Compton, K., "Scientific applications vs. SPEC-FP: a comparison of program behavior," Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., and Demmel, J., "Optimization of sparse matrix-vector multiplication on emerging multicore platforms," Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ACM New York, NY, USA, 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lin, Y.-K., Li, D.-W., Lin, C.-C., Kuo, T.-Y., Wu, S.-J.; Tai, W.-C.; Chang, W.-C. and Chang, T.-S., "A 242mw 10mm2 1080p H.264/AVC high profile encoder chip," Proceedings of the 45th Design Automation Conference, 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chang, H.-C., Chen, J.-W., Su, C.-L., Yang, Y.-C., Li, Y., Chang, C.-H., Chen, Z.-M., Yang, W.-S., Lin, C.-C., Chen, C.-W., Wang, J.-S. and Guo, J.-I., "A 7mw-to-183mw dynamic quality-scalable h.264 video encoder chip," Proceedings of 2007 IEEE ISSCC Dig. Tech. PapersGoogle ScholarGoogle Scholar
  31. ISO/IEC MPEG & ITU-T VCEG, "Fast Integer Pel and Fractional Pel Motion Estimation for JVT", JVT-F017, 2002Google ScholarGoogle Scholar
  32. Yin, P, Tourapis, H.-Y. C., Tourapis, A. M., and Boyce, J., "Fast mode decision and motion estimation for JVT/H.264", Proceedings of IEEE International Conference on Image Processing, 2003Google ScholarGoogle Scholar
  33. Chen, C.-Y., Chien, S.-Y., Huang, Y.-W., Chen, T.-C., Wang, T. C. and Chen, L.-Y., "Analysis and Architecture Design of Variable Block-Size Motion Estimation for H.264/AVC", IEEE Transactions on Circuits and Systems, 2006Google ScholarGoogle Scholar
  34. Li, S., Wei, X., Ikenaga, T. and Goto, S., "A VLSI architecture design of an edge based fast intra prediction mode decision algorithm for h.264/avc", Proceedings of the 17th ACM Great Lakes Symposium on VLSI Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chen, T-C., Huang, Y.-W. and Chen, L.-G., "Fully Utilized And Reusable Architecture For Fractional Motion Estimation Of H.264/Avc", Proceedings of IEEE International Conference On Acoustics Speech And Signal Processing, 2004Google ScholarGoogle Scholar
  36. Osorio, R. R., Bruguera, J. D., "High-Throughput Architecture for H.264/AVC CABAC Compression System", IEEE Transactions on Circuits and Systems for Video Technology, 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Chen, Y.-K., Li, E. Q., Zhou, X. and Ge, S., "Implementation of H.264 encoder and decoder on personal computers", Journal of Visual Communication and Image Representation, April 2006Google ScholarGoogle ScholarCross RefCross Ref
  38. Joint Video Team Reference Software JM8.6, ITU-TGoogle ScholarGoogle Scholar
  39. Firoozshahian, A., Solomatnikov, A., Shacham, O., Asgar, Z., Richardson, S., Kozyrakis, C., Horowitz, M., "A memory system design framework: creating smart memories", Proceedings of the 36th annual international symposium on Computer architecture, 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Solomatnikov, A., Firoozshahian, A. Shacham, O., Asgar, Z., Wachs, M, Qadeer, W, Richardson, S. and Horowitz, M., "Using a Configurable Processor Generator for Computer Architecture Prototyping", Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Understanding sources of inefficiency in general-purpose chips

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
        ISCA '10
        June 2010
        508 pages
        ISSN:0163-5964
        DOI:10.1145/1816038
        Issue’s Table of Contents
        • cover image ACM Conferences
          ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
          June 2010
          520 pages
          ISBN:9781450300537
          DOI:10.1145/1815961

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 June 2010

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader