Abstract
A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overlaps among execution and miss events (cache misses, TLB misses, and branch mispredictions).This paper shows that meaningful and accurate CPI stacks can be computed for superscalar out-of-order processors. Using interval analysis, a novel method for analyzing out-of-order processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches.
- A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th Very Large Database Conference, 1999. Google ScholarDigital Library
- J.M. Anderson, L.M. Berc, J. Dean, S. Ghemawat, M.R. Henzinger, S.A. Leung, R.L. Sites, M.T. Vandevoorde, C.A. Waldspurger, and W.E. Weihl. Continuous profiling: Where have all the cycles gone? ACM Transactions on Computer Systems, 15(4):357--390, Nov. 1997. Google ScholarDigital Library
- J. Dean, J.E. Hicks, C.A.Waldspurger, W.E.Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out of order processors. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-30), Dec. 1997. Google ScholarDigital Library
- S. Eyerman, J.E. Smith, and L. Eeckhout. Characterizing the branch misprediction penalty. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), pages 48--58, Mar. 2006.Google ScholarCross Ref
- B.A. Fields, R. Bodik, M.D. Hill, and C.J. Newburn. Interaction cost and shotgun profiling. ACM Transactions on Architecture and Code Optimization, 1(3):272--304, Sept. 2004. Google ScholarDigital Library
- A. Hartstein and T. R. Puzak. The optimal pipeline depth for a microprocessor. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA-29), pages 7--13, May 2002. Google ScholarDigital Library
- Intel. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization, May 2004. 251110-003.Google Scholar
- T. Karkhanis and J.E. Smith. A day in the life of a data cache miss. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI 2002) held in conjunction with ISCA-29, May 2002.Google Scholar
- T.S. Karkhanis and J.E. Smith. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA-31), pages 338--349, June 2004. Google ScholarDigital Library
- K. Keeton, D.A. Patterson, Y.Q. He, R.C. Raphael, and W.E. Baker. Performance characterization of a quad Pentium Pro SMP using OLTP workloads. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA-25), June 1998. Google ScholarDigital Library
- Y. Luo, J. Rubio, L.K. John, P. Seshadri, and A. Mericas. Benchmarking internet servers on superscalar machines. IEEE Computer, 36(2):34--40, Feb. 2003. Google ScholarDigital Library
- A. Mericas. POWER5 performance measurement and characterization. Tutorial at the IEEE International Symposium on Workload Characterization, Oct. 2005.Google Scholar
- A. Mericas. Performance monitoring on the POWER5 microprocessor. In L.K. John and L. Eeckhout, editors, Performance Evaluation and Benchmarking, pages 247--266. CRC Press, 2006.Google Scholar
- P. Michaud, A. Seznec, and S. Jourdan. Exploring instructionfetch bandwidth requirement in wide-issue superscalar processors. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT-1999), pages 2--10, Oct. 1999. Google ScholarDigital Library
- D.B. Noonburg and J.P. Shen. Theoretical modeling of superscalar processor performance. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO-27), pages 52--62, Nov. 1994. Google ScholarDigital Library
- D.B. Noonburg and J.P. Shen. A framework for statistical modeling of superscalar processor performance. In Proceedings of the third International Symposium on High-Performance Computer Architecture (HPCA-3), pages 298--309, Feb. 1997. Google ScholarDigital Library
- P. Ranganathan, K. Gharachorloo, S.V. Adve, and L.A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Oct. 1998. Google ScholarDigital Library
- E.M. Riseman and C.C. Foster. The inhibition of potential parallelism by conditional jumps. IEEE Transactions on Computers, C-21(12):1405--1411, Dec. 1972.Google ScholarDigital Library
- B. Sprunt. Pentium 4 performance-monitoring features. IEEE Micro, 22(4):72--82, July 2002. Google ScholarDigital Library
- T.M. Taha and D.S. Wills. An instruction throughput model of superscalar processors. In Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP), June 2003. Google ScholarDigital Library
- M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R10000 performance counters. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Jan. 1996. Google ScholarDigital Library
Index Terms
- A performance counter architecture for computing accurate CPI components
Recommendations
A performance counter architecture for computing accurate CPI components
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systemsA common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into ...
A performance counter architecture for computing accurate CPI components
Proceedings of the 2006 ASPLOS ConferenceA common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into ...
A performance counter architecture for computing accurate CPI components
Proceedings of the 2006 ASPLOS ConferenceA common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into ...
Comments