skip to main content
article

A performance counter architecture for computing accurate CPI components

Published:20 October 2006Publication History
Skip Abstract Section

Abstract

A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overlaps among execution and miss events (cache misses, TLB misses, and branch mispredictions).This paper shows that meaningful and accurate CPI stacks can be computed for superscalar out-of-order processors. Using interval analysis, a novel method for analyzing out-of-order processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches.

References

  1. A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th Very Large Database Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J.M. Anderson, L.M. Berc, J. Dean, S. Ghemawat, M.R. Henzinger, S.A. Leung, R.L. Sites, M.T. Vandevoorde, C.A. Waldspurger, and W.E. Weihl. Continuous profiling: Where have all the cycles gone? ACM Transactions on Computer Systems, 15(4):357--390, Nov. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Dean, J.E. Hicks, C.A.Waldspurger, W.E.Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out of order processors. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-30), Dec. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Eyerman, J.E. Smith, and L. Eeckhout. Characterizing the branch misprediction penalty. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), pages 48--58, Mar. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  5. B.A. Fields, R. Bodik, M.D. Hill, and C.J. Newburn. Interaction cost and shotgun profiling. ACM Transactions on Architecture and Code Optimization, 1(3):272--304, Sept. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Hartstein and T. R. Puzak. The optimal pipeline depth for a microprocessor. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA-29), pages 7--13, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Intel. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization, May 2004. 251110-003.Google ScholarGoogle Scholar
  8. T. Karkhanis and J.E. Smith. A day in the life of a data cache miss. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI 2002) held in conjunction with ISCA-29, May 2002.Google ScholarGoogle Scholar
  9. T.S. Karkhanis and J.E. Smith. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA-31), pages 338--349, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Keeton, D.A. Patterson, Y.Q. He, R.C. Raphael, and W.E. Baker. Performance characterization of a quad Pentium Pro SMP using OLTP workloads. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA-25), June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Luo, J. Rubio, L.K. John, P. Seshadri, and A. Mericas. Benchmarking internet servers on superscalar machines. IEEE Computer, 36(2):34--40, Feb. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Mericas. POWER5 performance measurement and characterization. Tutorial at the IEEE International Symposium on Workload Characterization, Oct. 2005.Google ScholarGoogle Scholar
  13. A. Mericas. Performance monitoring on the POWER5 microprocessor. In L.K. John and L. Eeckhout, editors, Performance Evaluation and Benchmarking, pages 247--266. CRC Press, 2006.Google ScholarGoogle Scholar
  14. P. Michaud, A. Seznec, and S. Jourdan. Exploring instructionfetch bandwidth requirement in wide-issue superscalar processors. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT-1999), pages 2--10, Oct. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D.B. Noonburg and J.P. Shen. Theoretical modeling of superscalar processor performance. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO-27), pages 52--62, Nov. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D.B. Noonburg and J.P. Shen. A framework for statistical modeling of superscalar processor performance. In Proceedings of the third International Symposium on High-Performance Computer Architecture (HPCA-3), pages 298--309, Feb. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Ranganathan, K. Gharachorloo, S.V. Adve, and L.A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Oct. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E.M. Riseman and C.C. Foster. The inhibition of potential parallelism by conditional jumps. IEEE Transactions on Computers, C-21(12):1405--1411, Dec. 1972.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Sprunt. Pentium 4 performance-monitoring features. IEEE Micro, 22(4):72--82, July 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T.M. Taha and D.S. Wills. An instruction throughput model of superscalar processors. In Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP), June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R10000 performance counters. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Jan. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A performance counter architecture for computing accurate CPI components

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 41, Issue 11
          Proceedings of the 2006 ASPLOS Conference
          November 2006
          425 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1168918
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
            October 2006
            440 pages
            ISBN:1595934510
            DOI:10.1145/1168857

          Copyright © 2006 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 October 2006

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader