article

Open access

Toward kilo-instruction processors

Authors:

Adrián Cristal,

Oliverio J. Santana,

José F. MartínezAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 1, Issue 4

Pages 389 - 417

https://doi.org/10.1145/1044823.1044825

Published: 01 December 2004 Publication History

Abstract

The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may require supporting many hundreds, or even thousands, of in-flight instructions. Unfortunately, the traditional approach of scaling up critical processor structures to provide such support is impractical at these levels, due to area, power, and cycle time constraints.In this paper we show that, in order to overcome this resource-scalability problem, the way in which critical processor resources are managed must be changed. Instead of simply upsizing the processor structures, we propose a smarter use of the available resources, supported by a selective checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder buffer unnecessary. We present a set of techniques such as multilevel instruction queues, late allocation and early release of registers, and early release of load/store queue entries. All together, these techniques constitute what we call a kilo-instruction processor, an architecture that can support thousands of in-flight instructions, and thus may achieve high performance even in the presence of large memory access latencies.

References

[1]

Akkary, H., Rajwar, R., and Srinivasan, S. T. 2003a. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th International Symposium on Microarchitecture (San Diego, CA). 423--434.

[2]

Akkary, H., Rajwar, R., and Srinivasan, S. T. 2003b. Checkpoint processing and recovery: An efficient, scalable alternative to reorder buffers. IEEE Micro 23, 6, 11--19.

Digital Library

[3]

Baer, J.-L. and Chen, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing'91 (Albuquerque, NM). 176--186.

[4]

Brekelbaum, E., Rupley, J., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the 35th International Symposium on Microarchitecture (Istanbul, Turkey). 27--36.

[5]

Bloom, B. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7, 422--426.

Digital Library

[6]

Cain, H. W. and Lipasti, M. H. 2004. Memory ordering: A value-based approach. In Proceedings of the 31st International Symposium on Computer Architecture (Munich, Germany). 90--101.

[7]

Collins, J. D., Tullsen, D. M., Wang, H., and Shen, J. P. 2001. Dynamic speculative precomputation. In Proceedings of the 34th International Symposium on Microarchitecture (Austin, TX). 306--317.

[8]

Chapell, R., Stark, J., Kim, S., Reinhardt, S., and Patt, Y. N. 1999. Simultaneous subordinate microthreading (SSMT). In Proceedings of the 26th International Symposium on Computer Architecture (Atlanta, GA). 186--195.

[9]

Chrysos, G. Z. and Emer, J. S. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th International Symposium on Computer Architecture (Barcelona, Spain). 142--153.

[10]

Cristal, A., Valero, M., Gonzalez, A., and Llosa, J. 2002a. White paper: Grant proposal to Intel-MRL in January 2002. Universitat Politècnica de Catalunya, Barcelona, Spain.

[11]

Cristal, A., Valero, M., Gonzalez, A., and Llosa, J. 2002b. Large Virtual ROBs by Processor Checkpointing. Technical Report UPC-DAC-2002-39, Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona, Spain. Also submitted to the 35th International Symposium on Microarchitecture, in 2002.

[12]

Cristal, A., Martínez, J. F., Llosa, J., and Valero, M. 2003a. A case for resource-conscious out-of-order processors. IEEE TCCA Computer Architecture Letters 2.

[13]

Cristal, A., Ortega, D., Llosa, J., and Valero, M. 2003b. Kilo-instruction processors. In Proceedings of the 5th International Symposium on High-Performance Computing (Tokyo, Japan). 10--25. Keynote paper.

[14]

Cristal, A., Martínez, J. F., Llosa, J., and Valero, M. 2003c. Ephemeral Registers with Multicheckpointing. Technical Report UPC-DAC-2003-51, Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona, Spain.

[15]

Cristal, A., Ortega, D., Llosa, J., and Valero, M. 2004a. Out-of-order commit processors. In Proceedings of the 10th International Symposium on High-Performance Computer Architecture (Madrid, Spain). 48--59.

Digital Library

[16]

Cristal, A., Santana, O. J., and Valero, M. 2004b. Maintaining thousands of in-flight instructions. In Proceedings of the Euro-Par Conference (Pisa, Italy). Keynote paper.

[17]

Dubois, M. and Song, Y. 1998. Assisted Execution. Technical Report CENG 98--25, Department of EE-Systems, University of Southern California, Los Angeles, CA.

[18]

Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 9th International Conference on Supercomputing (Vienna, Austria). 68--75.

[19]

Galluzzi, M., Puente, V., Cristal, A., Beivide, R., Gregorio, J. A., and Valero, M. 2004. A first glance at kilo-instruction based multiprocessors. In Proceedings of the 1st Conference on Computer Frontiers (Ischia, Italy).

[20]

Hwu, W. M. and Patt, Y. N. 1987. Checkpoint repair for out-of-order execution machines. In Proceedings of the 14th International Symposium on Computer Architecture (Pittsburgh, PA). 18--26.

[21]

Joseph, D. and Grunwald, D. 1997. Prefetching using Markov predictors. In Proceedings of the 24th International Symposium on Computer Architecture (Denver, CO). 252--263.

[22]

Klaiber, A. and Levi, H. 1991. An architecture for software-controlled data prefetching. In Proceedings of the 18th International Symposium on Computer Architecture (Toronto, Canada). 53--53.

[23]

Lebeck, A., Koppanalil, T., Li, T., Patwardhan, J., and Rotenberg, E. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the 29th International Symposium on Computer Architecture (Anchorage, AK). 59--70.

[24]

Martínez, J. F., Renau, J., Huang, M., Prvulovic, M., and Torrellas, J. 2002. Checkpointed early resource recycling in out-of-order microprocessors. In Proceedings of the 35th International Symposium on Microarchitecture (Istanbul, Turkey). 3--14.

[25]

Martínez, J. F., Cristal, A., Valero, M., and Llosa, J. 2003. Ephemeral Registers. Technical Report CSL-TR-2003-1035, Cornell Computer Systems Lab, Ithaca, NY.

[26]

Monreal, T., Gonzalez, A., Valero, M., Gonzalez, J., and Vi NALS, V. 1999. Delaying physical register allocation through virtual-physical registers. In Proceedings of the 32nd International Symposium on Microarchitecture (Haifa, Israel). 186--192.

[27]

Mowry, T., Lam, M., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA). 62--73.

[28]

Moudgill, M., Pingali, K., and Vassiliadis, S. 1993. Register renaming and dynamic speculation: An alternative approach. In Proceedings of the 26th International Symposium on Microarchitecture (Austin, TX). 202--213.

[29]

Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003a. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (Anaheim, CA). 129--140.

[30]

Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003b. Runahead execution: An effective alternative to large instruction windows. IEEE Micro 23, 6, 20--25.

Digital Library

[31]

Palacharla, S., Jouppi, N., and Smith, J. 1997. Complexity-effective superscalar processors. In Proceedings of the 24th International Symposium on Computer Architecture (Denver, CO). 206--218.

[32]

Park, I., Ooi, C., and Vijaykumar, T. 2003. Reducing design complexity of the load/store queue. In Proceedings of the 36th International Symposium on Microarchitecture (San Diego, CA). 411--422.

[33]

Roth, A. and Sohi, G. S. 2001. Speculative data-driven multithreading. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (Nuevo Leone, Mexico). 37--48.

[34]

Sethumadhavan, S., Desikan, R., Burger, D., Moore, C., and Keckler, S. 2003. Scalable hardware memory disambiguation for high ILP processors. In Proceedings of the 36th International Symposium on Microarchitecture (San Diego, CA). 399--410.

[35]

Sherwood, T., Perelman, E., and Calder, B. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (Barcelona, Spain). 3--14.

[36]

Smith, A. 1982. Cache memories. Computing Surveys 14, 3, 473--530.

Digital Library

[37]

Smith, J. E. and Pleszkun, A. R. 1985. Implementation of precise interrupts in pipelined processors. In Proceedings of the 12th International Symposium on Computer Architecture (Boston, MA). 36--44.

[38]

Sohilin, Y., Lee, J., and Torrellas, J. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th International Symposium on Computer Architecture (Anchorage, AK). 171--182.

[39]

Srinivasan, S. T., Rajwar, R., Akkary, H., Gandhi, A., and Upton, M. 2004. Continual flow pipelines. In Proceedings of the 11st International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA). 107--119.

[40]

Tendler, J. M., Dodson, S., Fields, S., Le, H., and Sinharoy, B. 2001. IBM &commat;server POWER4 System Microarchitecture. IBM Technical White Paper. IBM Server Group.

[41]

Zilles, C. and Sohi, G. S. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th International Symposium on Computer Architecture (Göteborg, Sweden). 2--13.

Cited By

Mehta S(2023)Speculative Register Reclamation2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071122(1182-1194)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071122
Ham TAragón JMartonosi M(2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3310332
Jeong ILee CKim KRo W(2019)OverCome: Coarse-Grained Instruction Commit with Handover Register RenamingIEEE Transactions on Computers10.1109/TC.2019.293655768:12(1802-1816)Online publication date: 1-Dec-2019
https://doi.org/10.1109/TC.2019.2936557
Show More Cited By

Index Terms

Toward kilo-instruction processors

Recommendations

Kilo-instruction processors, runahead and prefetching
CF '06: Proceedings of the 3rd conference on Computing frontiers

There is a continuous research effort devoted to overcome the memory wall problem. Prefetching is one of the most frequently used techniques. A prefetch mechanism anticipates the processor requests by moving data into the lower levels of the memory ...
Kilo-Instruction Processors: Overcoming the Memory Wall

Kilo-instruction processors are a new type of out-of-order superscalar processor that overlaps long memory access delays by maintaining thousands of in-flight instructions, in a scalable, efficient manner.
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 1, Issue 4

December 2004

107 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/1044823

Issue’s Table of Contents

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2004

Published in TACO Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

93
Total Citations
View Citations
1,663
Total Downloads

Downloads (Last 12 months)120
Downloads (Last 6 weeks)22

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mehta S(2023)Speculative Register Reclamation2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071122(1182-1194)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071122
Ham TAragón JMartonosi M(2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3310332
Jeong ILee CKim KRo W(2019)OverCome: Coarse-Grained Instruction Commit with Handover Register RenamingIEEE Transactions on Computers10.1109/TC.2019.293655768:12(1802-1816)Online publication date: 1-Dec-2019
https://doi.org/10.1109/TC.2019.2936557
Alipour MCarlson TBlack-Schaffer DKaxiras S(2019)Maximizing Limited ResourcesJournal of Signal Processing Systems10.1007/s11265-018-1369-491:3-4(379-397)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11265-018-1369-4
Jin ZÖnder S(2018)A two-phase recovery mechanismProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205300(107-117)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205300
Ros ACarlson TAlipour MKaxiras S(2017)Non-Speculative Load-Load Reordering in TSOACM SIGARCH Computer Architecture News10.1145/3140659.308022045:2(187-200)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3140659.3080220
Ros ACarlson TAlipour MKaxiras S(2017)Non-Speculative Load-Load Reordering in TSOProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080220(187-200)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3079856.3080220
Ham TAragón JMartonosi M(2017)Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/307562014:2(1-27)Online publication date: 28-Jun-2017
https://dl.acm.org/doi/10.1145/3075620
Alipour MCarlson TKaxiras SGiorgi RBecchi MPalumbo F(2017)Exploring the Performance Limits of Out-of-order CommitProceedings of the Computing Frontiers Conference10.1145/3075564.3075581(211-220)Online publication date: 15-May-2017
https://dl.acm.org/doi/10.1145/3075564.3075581
Lee DLee SRyu SChoi K(2017)Dirty-Block Tracking in a Direct-Mapped DRAM Cache with Self-Balancing DispatchACM Transactions on Architecture and Code Optimization10.1145/306846014:2(1-25)Online publication date: 10-May-2017
https://dl.acm.org/doi/10.1145/3068460
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents