skip to main content
research-article

A mechanistic performance model for superscalar out-of-order processors

Published:29 May 2009Publication History
Skip Abstract Section

Abstract

A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an interval's type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7%.

The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processor's dispatch width is wider than its issue width are of particular interest.

References

  1. Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D. 2000. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA), 248--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Berg, E. and Hagersten, E. 2005. Fast data-locality profiling of native execution. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS), 169--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Brooks, D., Martonosi, M., and Bose, P. 2000. Abstraction via separable components: An empirical study of absolute and relative accuracy in processor performance modeling. Tech. rep. RC 21909, IBM Research Division, T. J. Watson Research Center. December.Google ScholarGoogle Scholar
  4. Burger, D. C. and Austin, T. M. 1997. The SimpleScalar tool set. Comput. Architecture News. See also http://www.simplescalar.com for more information. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chou, Y., Fahs, B., and Abraham, S. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 76--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cristal, A., Santana, O. J., Valero, M., and Martinez, J. F. 2004. Toward kilo-instruction processors. ACM Trans. Architecture Code Optimiz. 1, 4, 389--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dubey, P. K., Adams III, G. B., and Flynn, M. J. 1994. Instruction window size trade-offs and characterization of program parallelism. IEEE Trans. Comput. 43, 4, 431--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dubey, P. K. and Flynn, M. J. 1990. Optimal pipelining. J. Parallel Distrib. Comput. 8, 1, 10--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Eeckhout, L. and De Bosschere, K. 2001. Hybrid analytical-statistical modeling for efficiently exploring architecture and workload design spaces. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Emma, P. G. 1997. Understanding some simple processor-performance limits. IBM J. Res. Development 41, 3, 215--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Emma, P. G. and Davidson, E. S. 1987. Characterization of branch and data dependencies in programs for evaluating pipeline performance. IEEE Trans. Comput. 36, 7, 859--875. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Eyerman, S., Eeckhout, L., Karkhanis, T., and Smith, J. E. 2006a. A performance counter architecture for computing accurate CPI components. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Eyerman, S., Eeckhout, L., Karkhanis, T., and Smith, J. E. 2007. A top-down approach to architecting CPI component performance counters. IEEE Micro 17, 1, 84--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eyerman, S., Smith, J. E., and Eeckhout, L. 2006b. Characterizing the branch misprediction penalty. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 48--58.Google ScholarGoogle Scholar
  15. Fields, B. A., Bodik, R., Hill, M. D., and Newburn, C. J. 2004. Interaction cost and shotgun profiling. ACM Trans. Architecture Code Optimiz. 1, 3, 272--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Glew, A. 1998. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea Session.Google ScholarGoogle Scholar
  17. Guo, F. and Solihin, Y. 2006. An analytical model for cache replacement policy performance. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS), 228--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hartstein, A. and Puzak, T. R. 2002. The optimal pipeline depth for a microprocessor. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 7--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hartstein, A. and Puzak, T. R. 2003. Optimum power/performance pipeline depth. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO), 117--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hrishikesh, M. S., Jouppi, N. P., Farkas, K. I., Burger, D., Keckler, S. W., and Shivakumar, P. 2002. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 14--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ipek, E., McKee, S. A., de Supinski, B. R., Schulz, M., and Caruana, R. 2006. Efficiently exploring architectural design spaces via predictive modeling. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 195--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Joseph, P. J., Vaswani, K., and Thazhuthaveetil, M. J. 2006a. Construction and use of linear regression models for processor performance analysis. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), 99--108.Google ScholarGoogle Scholar
  23. Joseph, P. J., Vaswani, K., and Thazhuthaveetil, M. J. 2006b. A predictive performance model for superscalar processors. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Karkhanis, T. and Smith, J. E. 2002. A day in the life of a data cache miss. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI) held in conjunction with ISCA.Google ScholarGoogle Scholar
  25. Karkhanis, T. and Smith, J. E. 2007. Automated design of application specific superscalar processors: An analytical approach. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), 402--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Karkhanis, T. S. and Smith, J. E. 2004. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 338--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kunkel, S. and Smith, J. E. 1986. Optimal pipelining in supercomputers. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA), 404--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lee, B. and Brooks, D. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 185--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Michaud, P., Seznec, A., and Jourdan, S. 1999. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Michaud, P., Seznec, A., and Jourdan, S. 2001. An exploration of instruction fetch requirement in out-of-order superscalar processors. Internal J. Parallel Program. 29, 1. Google ScholarGoogle ScholarCross RefCross Ref
  31. Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), 129--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Noonburg, D. B. and Shen, J. P. 1997. Theoretical modeling of superscalar processor performance. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO), 52--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Noonburg, D. B. and Shen, J. P. 1994. A framework for statistical modeling of superscalar processor performance. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture (HPCA), 298--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Riseman, E. M. and Foster, C. C. 1972. The inhibition of potential parallelism by conditional jumps. IEEE Trans. Comput. C-21, 12, 1405--1411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 45--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sorin, D. J., Pai, V. S., Adve, S. V., Vernon, M. K., and Wood, D. A. 1998. Analytic evaluation of shared-memory systems with ILP processors. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), 380--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sprangle, E. and Carmean, D. 2002. Increasing processor performance by implementing deeper pipelines. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Srinivasan, S. T., Rajwar, R., Akkary, H., Gandhi, A., and Upton, M. 2004. Continual flow pipelines. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 107--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Srinivasan, V., Brooks, D., Gschwind, M., Bose, P., Zyuban, V., Strenski, R. N., and Emma, P. G. 2002. Optimizing pipelines for power and performance. In Proceedings of the 35th Annual International Symposium on Microarchitecture (MICRO), 333--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Taha, T. M. and Wills, D. S. 2003. An instruction throughput model of superscalar processors. In Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP), 156--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Taha, T. M. and Wills, D. S. 2008. An instruction throughput model of superscalar processors. IEEE Trans. Comput. 57, 3, 389--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wall, D. W. 1991. Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), 176--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhong, Y., Dropsho, S. G., and Ding, C. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A mechanistic performance model for superscalar out-of-order processors

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader