Abstract
A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an interval's type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7%.
The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processor's dispatch width is wider than its issue width are of particular interest.
- Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D. 2000. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA), 248--259. Google ScholarDigital Library
- Berg, E. and Hagersten, E. 2005. Fast data-locality profiling of native execution. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS), 169--180. Google ScholarDigital Library
- Brooks, D., Martonosi, M., and Bose, P. 2000. Abstraction via separable components: An empirical study of absolute and relative accuracy in processor performance modeling. Tech. rep. RC 21909, IBM Research Division, T. J. Watson Research Center. December.Google Scholar
- Burger, D. C. and Austin, T. M. 1997. The SimpleScalar tool set. Comput. Architecture News. See also http://www.simplescalar.com for more information. Google ScholarDigital Library
- Chou, Y., Fahs, B., and Abraham, S. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 76--87. Google ScholarDigital Library
- Cristal, A., Santana, O. J., Valero, M., and Martinez, J. F. 2004. Toward kilo-instruction processors. ACM Trans. Architecture Code Optimiz. 1, 4, 389--417. Google ScholarDigital Library
- Dubey, P. K., Adams III, G. B., and Flynn, M. J. 1994. Instruction window size trade-offs and characterization of program parallelism. IEEE Trans. Comput. 43, 4, 431--442. Google ScholarDigital Library
- Dubey, P. K. and Flynn, M. J. 1990. Optimal pipelining. J. Parallel Distrib. Comput. 8, 1, 10--19. Google ScholarDigital Library
- Eeckhout, L. and De Bosschere, K. 2001. Hybrid analytical-statistical modeling for efficiently exploring architecture and workload design spaces. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 25--34. Google ScholarDigital Library
- Emma, P. G. 1997. Understanding some simple processor-performance limits. IBM J. Res. Development 41, 3, 215--232. Google ScholarDigital Library
- Emma, P. G. and Davidson, E. S. 1987. Characterization of branch and data dependencies in programs for evaluating pipeline performance. IEEE Trans. Comput. 36, 7, 859--875. Google ScholarDigital Library
- Eyerman, S., Eeckhout, L., Karkhanis, T., and Smith, J. E. 2006a. A performance counter architecture for computing accurate CPI components. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 175--184. Google ScholarDigital Library
- Eyerman, S., Eeckhout, L., Karkhanis, T., and Smith, J. E. 2007. A top-down approach to architecting CPI component performance counters. IEEE Micro 17, 1, 84--93. Google ScholarDigital Library
- Eyerman, S., Smith, J. E., and Eeckhout, L. 2006b. Characterizing the branch misprediction penalty. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 48--58.Google Scholar
- Fields, B. A., Bodik, R., Hill, M. D., and Newburn, C. J. 2004. Interaction cost and shotgun profiling. ACM Trans. Architecture Code Optimiz. 1, 3, 272--304. Google ScholarDigital Library
- Glew, A. 1998. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea Session.Google Scholar
- Guo, F. and Solihin, Y. 2006. An analytical model for cache replacement policy performance. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS), 228--239. Google ScholarDigital Library
- Hartstein, A. and Puzak, T. R. 2002. The optimal pipeline depth for a microprocessor. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 7--13. Google ScholarDigital Library
- Hartstein, A. and Puzak, T. R. 2003. Optimum power/performance pipeline depth. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO), 117--126. Google ScholarDigital Library
- Hrishikesh, M. S., Jouppi, N. P., Farkas, K. I., Burger, D., Keckler, S. W., and Shivakumar, P. 2002. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 14--24. Google ScholarDigital Library
- Ipek, E., McKee, S. A., de Supinski, B. R., Schulz, M., and Caruana, R. 2006. Efficiently exploring architectural design spaces via predictive modeling. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 195--206. Google ScholarDigital Library
- Joseph, P. J., Vaswani, K., and Thazhuthaveetil, M. J. 2006a. Construction and use of linear regression models for processor performance analysis. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), 99--108.Google Scholar
- Joseph, P. J., Vaswani, K., and Thazhuthaveetil, M. J. 2006b. A predictive performance model for superscalar processors. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 161--170. Google ScholarDigital Library
- Karkhanis, T. and Smith, J. E. 2002. A day in the life of a data cache miss. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI) held in conjunction with ISCA.Google Scholar
- Karkhanis, T. and Smith, J. E. 2007. Automated design of application specific superscalar processors: An analytical approach. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), 402--411. Google ScholarDigital Library
- Karkhanis, T. S. and Smith, J. E. 2004. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 338--349. Google ScholarDigital Library
- Kunkel, S. and Smith, J. E. 1986. Optimal pipelining in supercomputers. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA), 404--411. Google ScholarDigital Library
- Lee, B. and Brooks, D. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 185--194. Google ScholarDigital Library
- Michaud, P., Seznec, A., and Jourdan, S. 1999. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2--10. Google ScholarDigital Library
- Michaud, P., Seznec, A., and Jourdan, S. 2001. An exploration of instruction fetch requirement in out-of-order superscalar processors. Internal J. Parallel Program. 29, 1. Google ScholarCross Ref
- Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), 129--140. Google ScholarDigital Library
- Noonburg, D. B. and Shen, J. P. 1997. Theoretical modeling of superscalar processor performance. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO), 52--62. Google ScholarDigital Library
- Noonburg, D. B. and Shen, J. P. 1994. A framework for statistical modeling of superscalar processor performance. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture (HPCA), 298--309. Google ScholarDigital Library
- Riseman, E. M. and Foster, C. C. 1972. The inhibition of potential parallelism by conditional jumps. IEEE Trans. Comput. C-21, 12, 1405--1411. Google ScholarDigital Library
- Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 45--57. Google ScholarDigital Library
- Sorin, D. J., Pai, V. S., Adve, S. V., Vernon, M. K., and Wood, D. A. 1998. Analytic evaluation of shared-memory systems with ILP processors. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), 380--391. Google ScholarDigital Library
- Sprangle, E. and Carmean, D. 2002. Increasing processor performance by implementing deeper pipelines. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 25--34. Google ScholarDigital Library
- Srinivasan, S. T., Rajwar, R., Akkary, H., Gandhi, A., and Upton, M. 2004. Continual flow pipelines. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 107--119. Google ScholarDigital Library
- Srinivasan, V., Brooks, D., Gschwind, M., Bose, P., Zyuban, V., Strenski, R. N., and Emma, P. G. 2002. Optimizing pipelines for power and performance. In Proceedings of the 35th Annual International Symposium on Microarchitecture (MICRO), 333--344. Google ScholarDigital Library
- Taha, T. M. and Wills, D. S. 2003. An instruction throughput model of superscalar processors. In Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP), 156--163. Google ScholarDigital Library
- Taha, T. M. and Wills, D. S. 2008. An instruction throughput model of superscalar processors. IEEE Trans. Comput. 57, 3, 389--403. Google ScholarDigital Library
- Wall, D. W. 1991. Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), 176--188. Google ScholarDigital Library
- Zhong, Y., Dropsho, S. G., and Ding, C. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarDigital Library
Index Terms
- A mechanistic performance model for superscalar out-of-order processors
Recommendations
Mechanistic Analytical Modeling of Superscalar In-Order Processor Performance
Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance estimates or ...
A mechanistic performance model for superscalar in-order processors
ISPASS '12: Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & SoftwareMechanistic processor performance modeling builds an analytical model from understanding the underlying mechanisms in the processor and provides fundamental insight in program-microarchitecture interactions, as well as microarchitecture structure ...
Mechanistic Modeling of Architectural Vulnerability Factor
Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural ...
Comments