article

A performance counter architecture for computing accurate CPI components

Authors:
Stijn Eyerman

Ghent University

Ghent University
View Profile

,
Lieven Eeckhout

Ghent University

Ghent University
View Profile

,
Tejas Karkhanis

University of Wisconsin-Madison

University of Wisconsin-Madison
View Profile

,
James E. Smith

University of Wisconsin-Madison

University of Wisconsin-Madison
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 41 Issue 11November 2006pp 175–184https://doi.org/10.1145/1168918.1168880

Published:20 October 2006Publication History

ACM SIGPLAN Notices

Abstract

A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overlaps among execution and miss events (cache misses, TLB misses, and branch mispredictions).This paper shows that meaningful and accurate CPI stacks can be computed for superscalar out-of-order processors. Using interval analysis, a novel method for analyzing out-of-order processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches.

References

A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th Very Large Database Conference, 1999. Google ScholarDigital Library
J.M. Anderson, L.M. Berc, J. Dean, S. Ghemawat, M.R. Henzinger, S.A. Leung, R.L. Sites, M.T. Vandevoorde, C.A. Waldspurger, and W.E. Weihl. Continuous profiling: Where have all the cycles gone? ACM Transactions on Computer Systems, 15(4):357--390, Nov. 1997. Google ScholarDigital Library
J. Dean, J.E. Hicks, C.A.Waldspurger, W.E.Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out of order processors. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-30), Dec. 1997. Google ScholarDigital Library
S. Eyerman, J.E. Smith, and L. Eeckhout. Characterizing the branch misprediction penalty. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), pages 48--58, Mar. 2006.Google ScholarCross Ref
B.A. Fields, R. Bodik, M.D. Hill, and C.J. Newburn. Interaction cost and shotgun profiling. ACM Transactions on Architecture and Code Optimization, 1(3):272--304, Sept. 2004. Google ScholarDigital Library
A. Hartstein and T. R. Puzak. The optimal pipeline depth for a microprocessor. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA-29), pages 7--13, May 2002. Google ScholarDigital Library
Intel. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization, May 2004. 251110-003.Google Scholar
T. Karkhanis and J.E. Smith. A day in the life of a data cache miss. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI 2002) held in conjunction with ISCA-29, May 2002.Google Scholar
T.S. Karkhanis and J.E. Smith. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA-31), pages 338--349, June 2004. Google ScholarDigital Library
K. Keeton, D.A. Patterson, Y.Q. He, R.C. Raphael, and W.E. Baker. Performance characterization of a quad Pentium Pro SMP using OLTP workloads. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA-25), June 1998. Google ScholarDigital Library
Y. Luo, J. Rubio, L.K. John, P. Seshadri, and A. Mericas. Benchmarking internet servers on superscalar machines. IEEE Computer, 36(2):34--40, Feb. 2003. Google ScholarDigital Library
A. Mericas. POWER5 performance measurement and characterization. Tutorial at the IEEE International Symposium on Workload Characterization, Oct. 2005.Google Scholar
A. Mericas. Performance monitoring on the POWER5 microprocessor. In L.K. John and L. Eeckhout, editors, Performance Evaluation and Benchmarking, pages 247--266. CRC Press, 2006.Google Scholar
P. Michaud, A. Seznec, and S. Jourdan. Exploring instructionfetch bandwidth requirement in wide-issue superscalar processors. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT-1999), pages 2--10, Oct. 1999. Google ScholarDigital Library
D.B. Noonburg and J.P. Shen. Theoretical modeling of superscalar processor performance. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO-27), pages 52--62, Nov. 1994. Google ScholarDigital Library
D.B. Noonburg and J.P. Shen. A framework for statistical modeling of superscalar processor performance. In Proceedings of the third International Symposium on High-Performance Computer Architecture (HPCA-3), pages 298--309, Feb. 1997. Google ScholarDigital Library
P. Ranganathan, K. Gharachorloo, S.V. Adve, and L.A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Oct. 1998. Google ScholarDigital Library
E.M. Riseman and C.C. Foster. The inhibition of potential parallelism by conditional jumps. IEEE Transactions on Computers, C-21(12):1405--1411, Dec. 1972.Google ScholarDigital Library
B. Sprunt. Pentium 4 performance-monitoring features. IEEE Micro, 22(4):72--82, July 2002. Google ScholarDigital Library
T.M. Taha and D.S. Wills. An instruction throughput model of superscalar processors. In Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP), June 2003. Google ScholarDigital Library
M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R10000 performance counters. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Jan. 1996. Google ScholarDigital Library

Index Terms

A performance counter architecture for computing accurate CPI components
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
2. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Metrics

Recommendations

A performance counter architecture for computing accurate CPI components
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems

A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into ...
Read More
A performance counter architecture for computing accurate CPI components
Proceedings of the 2006 ASPLOS Conference

A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into ...
Read More
A performance counter architecture for computing accurate CPI components
Proceedings of the 2006 ASPLOS Conference

A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 41, Issue 11
Proceedings of the 2006 ASPLOS Conference
November 2006
425 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1168918
Issue’s Table of Contents
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
October 2006
440 pages
ISBN:1595934510
DOI:10.1145/1168857
General Chair:
John Paul Shen
Intel Corp.
,
Program Chair:
Margaret R. Martonosi
Princeton University
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 October 2006
Check for updates
Author Tags
hardware performance counter architecture
superscalar processor performance modeling
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 174
  Total Citations
  View Citations
- 2,233
  Total Downloads
- Downloads (Last 12 months)155
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A performance counter architecture for computing accurate CPI components

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

A performance counter architecture for computing accurate CPI components

A performance counter architecture for computing accurate CPI components

A performance counter architecture for computing accurate CPI components