skip to main content
article
Open access

Toward kilo-instruction processors

Published: 01 December 2004 Publication History

Abstract

The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may require supporting many hundreds, or even thousands, of in-flight instructions. Unfortunately, the traditional approach of scaling up critical processor structures to provide such support is impractical at these levels, due to area, power, and cycle time constraints.In this paper we show that, in order to overcome this resource-scalability problem, the way in which critical processor resources are managed must be changed. Instead of simply upsizing the processor structures, we propose a smarter use of the available resources, supported by a selective checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder buffer unnecessary. We present a set of techniques such as multilevel instruction queues, late allocation and early release of registers, and early release of load/store queue entries. All together, these techniques constitute what we call a kilo-instruction processor, an architecture that can support thousands of in-flight instructions, and thus may achieve high performance even in the presence of large memory access latencies.

References

[1]
Akkary, H., Rajwar, R., and Srinivasan, S. T. 2003a. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th International Symposium on Microarchitecture (San Diego, CA). 423--434.
[2]
Akkary, H., Rajwar, R., and Srinivasan, S. T. 2003b. Checkpoint processing and recovery: An efficient, scalable alternative to reorder buffers. IEEE Micro 23, 6, 11--19.
[3]
Baer, J.-L. and Chen, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing'91 (Albuquerque, NM). 176--186.
[4]
Brekelbaum, E., Rupley, J., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the 35th International Symposium on Microarchitecture (Istanbul, Turkey). 27--36.
[5]
Bloom, B. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7, 422--426.
[6]
Cain, H. W. and Lipasti, M. H. 2004. Memory ordering: A value-based approach. In Proceedings of the 31st International Symposium on Computer Architecture (Munich, Germany). 90--101.
[7]
Collins, J. D., Tullsen, D. M., Wang, H., and Shen, J. P. 2001. Dynamic speculative precomputation. In Proceedings of the 34th International Symposium on Microarchitecture (Austin, TX). 306--317.
[8]
Chapell, R., Stark, J., Kim, S., Reinhardt, S., and Patt, Y. N. 1999. Simultaneous subordinate microthreading (SSMT). In Proceedings of the 26th International Symposium on Computer Architecture (Atlanta, GA). 186--195.
[9]
Chrysos, G. Z. and Emer, J. S. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th International Symposium on Computer Architecture (Barcelona, Spain). 142--153.
[10]
Cristal, A., Valero, M., Gonzalez, A., and Llosa, J. 2002a. White paper: Grant proposal to Intel-MRL in January 2002. Universitat Politècnica de Catalunya, Barcelona, Spain.
[11]
Cristal, A., Valero, M., Gonzalez, A., and Llosa, J. 2002b. Large Virtual ROBs by Processor Checkpointing. Technical Report UPC-DAC-2002-39, Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona, Spain. Also submitted to the 35th International Symposium on Microarchitecture, in 2002.
[12]
Cristal, A., Martínez, J. F., Llosa, J., and Valero, M. 2003a. A case for resource-conscious out-of-order processors. IEEE TCCA Computer Architecture Letters 2.
[13]
Cristal, A., Ortega, D., Llosa, J., and Valero, M. 2003b. Kilo-instruction processors. In Proceedings of the 5th International Symposium on High-Performance Computing (Tokyo, Japan). 10--25. Keynote paper.
[14]
Cristal, A., Martínez, J. F., Llosa, J., and Valero, M. 2003c. Ephemeral Registers with Multicheckpointing. Technical Report UPC-DAC-2003-51, Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona, Spain.
[15]
Cristal, A., Ortega, D., Llosa, J., and Valero, M. 2004a. Out-of-order commit processors. In Proceedings of the 10th International Symposium on High-Performance Computer Architecture (Madrid, Spain). 48--59.
[16]
Cristal, A., Santana, O. J., and Valero, M. 2004b. Maintaining thousands of in-flight instructions. In Proceedings of the Euro-Par Conference (Pisa, Italy). Keynote paper.
[17]
Dubois, M. and Song, Y. 1998. Assisted Execution. Technical Report CENG 98--25, Department of EE-Systems, University of Southern California, Los Angeles, CA.
[18]
Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 9th International Conference on Supercomputing (Vienna, Austria). 68--75.
[19]
Galluzzi, M., Puente, V., Cristal, A., Beivide, R., Gregorio, J. A., and Valero, M. 2004. A first glance at kilo-instruction based multiprocessors. In Proceedings of the 1st Conference on Computer Frontiers (Ischia, Italy).
[20]
Hwu, W. M. and Patt, Y. N. 1987. Checkpoint repair for out-of-order execution machines. In Proceedings of the 14th International Symposium on Computer Architecture (Pittsburgh, PA). 18--26.
[21]
Joseph, D. and Grunwald, D. 1997. Prefetching using Markov predictors. In Proceedings of the 24th International Symposium on Computer Architecture (Denver, CO). 252--263.
[22]
Klaiber, A. and Levi, H. 1991. An architecture for software-controlled data prefetching. In Proceedings of the 18th International Symposium on Computer Architecture (Toronto, Canada). 53--53.
[23]
Lebeck, A., Koppanalil, T., Li, T., Patwardhan, J., and Rotenberg, E. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the 29th International Symposium on Computer Architecture (Anchorage, AK). 59--70.
[24]
Martínez, J. F., Renau, J., Huang, M., Prvulovic, M., and Torrellas, J. 2002. Checkpointed early resource recycling in out-of-order microprocessors. In Proceedings of the 35th International Symposium on Microarchitecture (Istanbul, Turkey). 3--14.
[25]
Martínez, J. F., Cristal, A., Valero, M., and Llosa, J. 2003. Ephemeral Registers. Technical Report CSL-TR-2003-1035, Cornell Computer Systems Lab, Ithaca, NY.
[26]
Monreal, T., Gonzalez, A., Valero, M., Gonzalez, J., and Vi NALS, V. 1999. Delaying physical register allocation through virtual-physical registers. In Proceedings of the 32nd International Symposium on Microarchitecture (Haifa, Israel). 186--192.
[27]
Mowry, T., Lam, M., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA). 62--73.
[28]
Moudgill, M., Pingali, K., and Vassiliadis, S. 1993. Register renaming and dynamic speculation: An alternative approach. In Proceedings of the 26th International Symposium on Microarchitecture (Austin, TX). 202--213.
[29]
Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003a. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (Anaheim, CA). 129--140.
[30]
Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003b. Runahead execution: An effective alternative to large instruction windows. IEEE Micro 23, 6, 20--25.
[31]
Palacharla, S., Jouppi, N., and Smith, J. 1997. Complexity-effective superscalar processors. In Proceedings of the 24th International Symposium on Computer Architecture (Denver, CO). 206--218.
[32]
Park, I., Ooi, C., and Vijaykumar, T. 2003. Reducing design complexity of the load/store queue. In Proceedings of the 36th International Symposium on Microarchitecture (San Diego, CA). 411--422.
[33]
Roth, A. and Sohi, G. S. 2001. Speculative data-driven multithreading. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (Nuevo Leone, Mexico). 37--48.
[34]
Sethumadhavan, S., Desikan, R., Burger, D., Moore, C., and Keckler, S. 2003. Scalable hardware memory disambiguation for high ILP processors. In Proceedings of the 36th International Symposium on Microarchitecture (San Diego, CA). 399--410.
[35]
Sherwood, T., Perelman, E., and Calder, B. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (Barcelona, Spain). 3--14.
[36]
Smith, A. 1982. Cache memories. Computing Surveys 14, 3, 473--530.
[37]
Smith, J. E. and Pleszkun, A. R. 1985. Implementation of precise interrupts in pipelined processors. In Proceedings of the 12th International Symposium on Computer Architecture (Boston, MA). 36--44.
[38]
Sohilin, Y., Lee, J., and Torrellas, J. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th International Symposium on Computer Architecture (Anchorage, AK). 171--182.
[39]
Srinivasan, S. T., Rajwar, R., Akkary, H., Gandhi, A., and Upton, M. 2004. Continual flow pipelines. In Proceedings of the 11st International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA). 107--119.
[40]
Tendler, J. M., Dodson, S., Fields, S., Le, H., and Sinharoy, B. 2001. IBM @server POWER4 System Microarchitecture. IBM Technical White Paper. IBM Server Group.
[41]
Zilles, C. and Sohi, G. S. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th International Symposium on Computer Architecture (Göteborg, Sweden). 2--13.

Cited By

View all
  • (2023)Speculative Register Reclamation2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071122(1182-1194)Online publication date: Feb-2023
  • (2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
  • (2019)OverCome: Coarse-Grained Instruction Commit with Handover Register RenamingIEEE Transactions on Computers10.1109/TC.2019.293655768:12(1802-1816)Online publication date: 1-Dec-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 1, Issue 4
December 2004
107 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/1044823
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2004
Published in TACO Volume 1, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Memory wall
  2. instruction-level parallelism
  3. kilo-instruction processors
  4. multicheckpointing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)22
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Speculative Register Reclamation2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071122(1182-1194)Online publication date: Feb-2023
  • (2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
  • (2019)OverCome: Coarse-Grained Instruction Commit with Handover Register RenamingIEEE Transactions on Computers10.1109/TC.2019.293655768:12(1802-1816)Online publication date: 1-Dec-2019
  • (2019)Maximizing Limited ResourcesJournal of Signal Processing Systems10.1007/s11265-018-1369-491:3-4(379-397)Online publication date: 1-Mar-2019
  • (2018)A two-phase recovery mechanismProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205300(107-117)Online publication date: 12-Jun-2018
  • (2017)Non-Speculative Load-Load Reordering in TSOACM SIGARCH Computer Architecture News10.1145/3140659.308022045:2(187-200)Online publication date: 24-Jun-2017
  • (2017)Non-Speculative Load-Load Reordering in TSOProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080220(187-200)Online publication date: 24-Jun-2017
  • (2017)Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/307562014:2(1-27)Online publication date: 28-Jun-2017
  • (2017)Exploring the Performance Limits of Out-of-order CommitProceedings of the Computing Frontiers Conference10.1145/3075564.3075581(211-220)Online publication date: 15-May-2017
  • (2017)Dirty-Block Tracking in a Direct-Mapped DRAM Cache with Self-Balancing DispatchACM Transactions on Architecture and Code Optimization10.1145/306846014:2(1-25)Online publication date: 10-May-2017
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media