skip to main content
10.1145/1362622.1362650acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Variable latency caches for nanoscale processor

Published: 10 November 2007 Publication History

Abstract

Variability is one of the important issues in nanoscale processors. Due to increasing importance of interconnect structures in submicron technologies, the physical location and phenomena such as coupling have an increasing impact on the latency of operations. Therefore, traditional view of rigid access latencies to components wil result in suboptimal architectures. In this paper, we devise a cache architecture with variable access latency. Particularly, we a) develop a non-uniform access level 1 data-cache, b) study the impact of coupling and physical location on level 1 data cache access latencies, and c) develop and study an architecture where the variable latency cache can be accessed while the rest of the pipeline remains synchronous. To find the access latency with different input address transitions and environmental conditions, we first build a SPICE model at a 45nm technology for a cache similar to that of the level 1 data cache of the Intel Prescott architecture. Motivated by the large difference between the worst and best case latencies and the shape of the distribution curve, we change the cache architecture to allow variable latency accesses. Since the latency of the cache is not known at the time of instruction scheduling, we also modify the functional units with the addition of special queues that will temporarily store the dependent instructions and allow the data to be forwarded from the cache to the functional units correctly. Simulations based on SPEC2000 benchmarks show that our variable access latency cache structure can reduce the execution time by as much as 19.4% and 10.7% on average compared to a conventional cache architecture.

References

[1]
A. Agarwal, K. Roy, and T. N. Vijaykumar. Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology. in Design, Automation and Test in Europe Conf. and Exhibition. 2003. pp. 10778.
[2]
V. Agarwal et al. Clock rate vs. IPC: The end of the road for conventional microprocessors. in Intl. Symp. on Computer Architecture. Jun. 2000. pp. 248--259.
[3]
D. Besedin, Platform Benchmarking with RightMark Memory Analyzer. 2005, http://www.digit-life.com/articles2/rmma/rmma-p4.html.
[4]
S. Borkar. Microarchitecture and Design Challenges for Gigascale Integration. in Intl. Symp. on Microarchitecture. 2004.
[5]
Y. Cao et al. New paradigm of predictive MOSFET and interconnect modeling for early circuit design. In Custom Integrated Circuits Conf. 2000. pp. 201--204. http://www.eas.asu.edu/~ptm.
[6]
Z. Chishti, M. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. in Intl. Symp. on Microarchitecture. Dec. 2003. pp. 55--66.
[7]
D. Ernst et al. Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation. in IEEE Micro, November/December, 2004. 24(6): pp. 10--20.
[8]
S. Fujii and T. Sato. Non-uniform Set-Associative Caches for Power-Aware Embedded Processors. in Embedded And Ubiquitous Computing. 2004. pp. 217--226.
[9]
M. Ghoneima and Y. Ismail. Accurate Decoupling of Capacitively Coupled Buses. in Intl. Symp. on Circuits and System. May 2005.
[10]
H. Hidaka et al. Twisted Bitline Architectures for Multi-Megabit DRAM's. in IEEE Journal of Solid-State Circuits. Feb. 1989.
[11]
R. Ho, K. W. Mai, and M. A. Horowitz, The Future of Wires. in Proceedings of the IEEE., Apr. 2001.
[12]
H. M. Jacobson. Improved Clock-Gating through Transparent Pipelining. in Intl. Symp. on Low Power Electronics and Design. 2004. pp. 26--31.
[13]
D. R. Kerns and S. J. Eggers. Balanced Scheduling: Instruction Scheduling When Memory Latency is Uncertain. in ACM SIGPLAN Conf. on Programming Language Design and Implementation. 1993.
[14]
R. Kessler, The Alpha 21264 Microprocessor. in IEEE Micro, Mar/Apr 1999. 19(2).
[15]
C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. in Intl. Conf. on Architectural Support for Programming Languages and Operating Systems. Oct. 2002. pp. 211--222.
[16]
J. Koppanalil et al. A Case for Dynamic Pipeline Scaling. in CASES. 2002.
[17]
G. Memik, G. Reinman, and W. H. Mangione-Smith, Precise Instruction Scheduling. in Journal of Instruction-Level Parallelism, Jan. 2005.
[18]
S. Mitra et al. Robust System Design from Unreliable Components. in Intl. Symp. on Computer Architecture. 2005.
[19]
T. C. Mowry and C. K. Luk. Predicting Data Cache Misses in Non-Numeric Applications Through Correlation Profiling. in Intl. Symp. on Microarchitecture. Dec. 1997. pp. 314--320.
[20]
S. M. Müller. On the Scheduling of Variable Latency Functional Units. in Proceedings of the 11th ACM Symp. on Parallel Algorithms and Architectures. 1999.
[21]
K. Nose and T. Sakurai. Two Schemes to Reduce Interconnect Delays in Bi-Directional and Uni-Directional Buses. in Symp. on VLSI Circuits. 2001. pp. 193--194.
[22]
C. Park, Reversal of Temperature Dependence of Integrated Circuits Operating at Very Low Voltages. in IEEE Electron Devices Meeting. 1995. pp. 71--74.
[23]
H. Pilo et al. An 833MHz 1.5w 18Mb CMOS SRAM with 1.67Gb/s/pin. in Intl. Solid-State Circuits Conf. Feb. 2000. pp. 266--267.
[24]
S. Rusu et al. Trends and Challenges in VLSI Technology Scaling towards 100nm. in Asia and South Pacific Design Automation Conf. 2002. pp. 16--17.
[25]
T. Sato and I. Arita, Combining Variable Latency Pipeline with Instruction Reuse for Execution Latency Reduction. in Systems and Computers in Japan, 2003. 34(12).
[26]
T. Sherwood, E. Perelman, and B. Calder. Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. in Intl. Conf. on Parallel Architectures and Compilation Techniques. Sep. 2001.
[27]
H. Shimada, Ando, H., Shimada, T. Pipeline Stage Unification: A Low-Energy Consumption Technique for Future Mobile Processors. in Intl. Symp. on Low Power Electronics and Design. Aug. 2003. pp. 326--329.
[28]
P. Shivakumar and N. Jouppi, CACTI 3.0: an integrated cache timing, power and area model, in WRL Research Report. 2003.
[29]
SimpleScalar, The SimpleScalar Tool Set, Simplescalar Home Page. 2001, http://www.simplescalar.com.
[30]
Standard Performance Evaluation Council, Spec CPU2000: Performance Evaluation in the New Millennium, Version 1.1. Dec. 2000.
[31]
A. Tanino, T. Sato, and I. Arita. An Evaluation of Constructive Timing Violation via CSLA Design. in Cool Chips 2003.
[32]
O. Unsal et al. Impact of Parameter Variations on Circuits and Microarchitecture. in IEEE Micro, Nov/Dec 2006. 26(6).
[33]
D. C. Wong, G. D. Micheli, and M. J. Flynn, Designing high-performance digital circuits using wave pipelining:algorithms and practical experiences. in IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems. Jan 1993. 12(1): pp. 25--46.
[34]
J. S. Yim and C.-M. Kyung. Reducing cross-coupling among interconnect wires in deep-submicron datapath design. in Design Automation Conf. 1999. pp. 485--490.
[35]
V. Zyuban et al. Integrated Analysis of Power and Performance for Pipelined Microprocessors. in IEEE Trans. on Computers, August, 2004. 53(8): pp. 1004--1016.

Cited By

View all
  • (2022)3-Stage Pipelined Hierarchical SRAMs with Burst Mode Read in 65nm LSTP CMOS2022 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS48785.2022.9937582(1546-1550)Online publication date: 28-May-2022
  • (2020)Near-Threshold L1 Data Cache for Yield Management Under Process VariationsIEEE Access10.1109/ACCESS.2020.29686038(18558-18570)Online publication date: 2020
  • (2013)Asymmetric-access aware optimization for STT-RAM caches with process variationsProceedings of the 23rd ACM international conference on Great lakes symposium on VLSI10.1145/2483028.2483079(143-148)Online publication date: 2-May-2013
  • Show More Cited By
  1. Variable latency caches for nanoscale processor

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing
    November 2007
    723 pages
    ISBN:9781595937643
    DOI:10.1145/1362622
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 November 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SC '07
    Sponsor:

    Acceptance Rates

    SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)3-Stage Pipelined Hierarchical SRAMs with Burst Mode Read in 65nm LSTP CMOS2022 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS48785.2022.9937582(1546-1550)Online publication date: 28-May-2022
    • (2020)Near-Threshold L1 Data Cache for Yield Management Under Process VariationsIEEE Access10.1109/ACCESS.2020.29686038(18558-18570)Online publication date: 2020
    • (2013)Asymmetric-access aware optimization for STT-RAM caches with process variationsProceedings of the 23rd ACM international conference on Great lakes symposium on VLSI10.1145/2483028.2483079(143-148)Online publication date: 2-May-2013
    • (2010)Simulating a LAGS processor to consider variable latency on L1 D-CacheProceedings of the 2010 Summer Computer Simulation Conference10.5555/1999416.1999421(56-63)Online publication date: 11-Jul-2010

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media