research-article

Variable latency caches for nanoscale processor

Authors:

Serkan Ozdemir,

Arindam Mallik,

Yehea IsmailAuthors Info & Claims

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Article No.: 20, Pages 1 - 10

https://doi.org/10.1145/1362622.1362650

Published: 10 November 2007 Publication History

Abstract

Variability is one of the important issues in nanoscale processors. Due to increasing importance of interconnect structures in submicron technologies, the physical location and phenomena such as coupling have an increasing impact on the latency of operations. Therefore, traditional view of rigid access latencies to components wil result in suboptimal architectures. In this paper, we devise a cache architecture with variable access latency. Particularly, we a) develop a non-uniform access level 1 data-cache, b) study the impact of coupling and physical location on level 1 data cache access latencies, and c) develop and study an architecture where the variable latency cache can be accessed while the rest of the pipeline remains synchronous. To find the access latency with different input address transitions and environmental conditions, we first build a SPICE model at a 45nm technology for a cache similar to that of the level 1 data cache of the Intel Prescott architecture. Motivated by the large difference between the worst and best case latencies and the shape of the distribution curve, we change the cache architecture to allow variable latency accesses. Since the latency of the cache is not known at the time of instruction scheduling, we also modify the functional units with the addition of special queues that will temporarily store the dependent instructions and allow the data to be forwarded from the cache to the functional units correctly. Simulations based on SPEC2000 benchmarks show that our variable access latency cache structure can reduce the execution time by as much as 19.4% and 10.7% on average compared to a conventional cache architecture.

References

[1]

A. Agarwal, K. Roy, and T. N. Vijaykumar. Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology. in Design, Automation and Test in Europe Conf. and Exhibition. 2003. pp. 10778.

Digital Library

[2]

V. Agarwal et al. Clock rate vs. IPC: The end of the road for conventional microprocessors. in Intl. Symp. on Computer Architecture. Jun. 2000. pp. 248--259.

Digital Library

[3]

D. Besedin, Platform Benchmarking with RightMark Memory Analyzer. 2005, http://www.digit-life.com/articles2/rmma/rmma-p4.html.

[4]

S. Borkar. Microarchitecture and Design Challenges for Gigascale Integration. in Intl. Symp. on Microarchitecture. 2004.

Digital Library

[5]

Y. Cao et al. New paradigm of predictive MOSFET and interconnect modeling for early circuit design. In Custom Integrated Circuits Conf. 2000. pp. 201--204. http://www.eas.asu.edu/~ptm.

[6]

Z. Chishti, M. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. in Intl. Symp. on Microarchitecture. Dec. 2003. pp. 55--66.

Digital Library

[7]

D. Ernst et al. Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation. in IEEE Micro, November/December, 2004. 24(6): pp. 10--20.

Digital Library

[8]

S. Fujii and T. Sato. Non-uniform Set-Associative Caches for Power-Aware Embedded Processors. in Embedded And Ubiquitous Computing. 2004. pp. 217--226.

[9]

M. Ghoneima and Y. Ismail. Accurate Decoupling of Capacitively Coupled Buses. in Intl. Symp. on Circuits and System. May 2005.

[10]

H. Hidaka et al. Twisted Bitline Architectures for Multi-Megabit DRAM's. in IEEE Journal of Solid-State Circuits. Feb. 1989.

[11]

R. Ho, K. W. Mai, and M. A. Horowitz, The Future of Wires. in Proceedings of the IEEE., Apr. 2001.

[12]

H. M. Jacobson. Improved Clock-Gating through Transparent Pipelining. in Intl. Symp. on Low Power Electronics and Design. 2004. pp. 26--31.

Digital Library

[13]

D. R. Kerns and S. J. Eggers. Balanced Scheduling: Instruction Scheduling When Memory Latency is Uncertain. in ACM SIGPLAN Conf. on Programming Language Design and Implementation. 1993.

Digital Library

[14]

R. Kessler, The Alpha 21264 Microprocessor. in IEEE Micro, Mar/Apr 1999. 19(2).

Digital Library

[15]

C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. in Intl. Conf. on Architectural Support for Programming Languages and Operating Systems. Oct. 2002. pp. 211--222.

Digital Library

[16]

J. Koppanalil et al. A Case for Dynamic Pipeline Scaling. in CASES. 2002.

Digital Library

[17]

G. Memik, G. Reinman, and W. H. Mangione-Smith, Precise Instruction Scheduling. in Journal of Instruction-Level Parallelism, Jan. 2005.

[18]

S. Mitra et al. Robust System Design from Unreliable Components. in Intl. Symp. on Computer Architecture. 2005.

[19]

T. C. Mowry and C. K. Luk. Predicting Data Cache Misses in Non-Numeric Applications Through Correlation Profiling. in Intl. Symp. on Microarchitecture. Dec. 1997. pp. 314--320.

Digital Library

[20]

S. M. Müller. On the Scheduling of Variable Latency Functional Units. in Proceedings of the 11th ACM Symp. on Parallel Algorithms and Architectures. 1999.

Digital Library

[21]

K. Nose and T. Sakurai. Two Schemes to Reduce Interconnect Delays in Bi-Directional and Uni-Directional Buses. in Symp. on VLSI Circuits. 2001. pp. 193--194.

[22]

C. Park, Reversal of Temperature Dependence of Integrated Circuits Operating at Very Low Voltages. in IEEE Electron Devices Meeting. 1995. pp. 71--74.

[23]

H. Pilo et al. An 833MHz 1.5w 18Mb CMOS SRAM with 1.67Gb/s/pin. in Intl. Solid-State Circuits Conf. Feb. 2000. pp. 266--267.

[24]

S. Rusu et al. Trends and Challenges in VLSI Technology Scaling towards 100nm. in Asia and South Pacific Design Automation Conf. 2002. pp. 16--17.

Digital Library

[25]

T. Sato and I. Arita, Combining Variable Latency Pipeline with Instruction Reuse for Execution Latency Reduction. in Systems and Computers in Japan, 2003. 34(12).

[26]

T. Sherwood, E. Perelman, and B. Calder. Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. in Intl. Conf. on Parallel Architectures and Compilation Techniques. Sep. 2001.

Digital Library

[27]

H. Shimada, Ando, H., Shimada, T. Pipeline Stage Unification: A Low-Energy Consumption Technique for Future Mobile Processors. in Intl. Symp. on Low Power Electronics and Design. Aug. 2003. pp. 326--329.

Digital Library

[28]

P. Shivakumar and N. Jouppi, CACTI 3.0: an integrated cache timing, power and area model, in WRL Research Report. 2003.

[29]

SimpleScalar, The SimpleScalar Tool Set, Simplescalar Home Page. 2001, http://www.simplescalar.com.

[30]

Standard Performance Evaluation Council, Spec CPU2000: Performance Evaluation in the New Millennium, Version 1.1. Dec. 2000.

[31]

A. Tanino, T. Sato, and I. Arita. An Evaluation of Constructive Timing Violation via CSLA Design. in Cool Chips 2003.

[32]

O. Unsal et al. Impact of Parameter Variations on Circuits and Microarchitecture. in IEEE Micro, Nov/Dec 2006. 26(6).

Digital Library

[33]

D. C. Wong, G. D. Micheli, and M. J. Flynn, Designing high-performance digital circuits using wave pipelining:algorithms and practical experiences. in IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems. Jan 1993. 12(1): pp. 25--46.

[34]

J. S. Yim and C.-M. Kyung. Reducing cross-coupling among interconnect wires in deep-submicron datapath design. in Design Automation Conf. 1999. pp. 485--490.

Digital Library

[35]

V. Zyuban et al. Integrated Analysis of Power and Performance for Pipelined Microprocessors. in IEEE Trans. on Computers, August, 2004. 53(8): pp. 1004--1016.

Digital Library

Cited By

Srivastav MRimjhim Mishra RGrover ADhori KRawat H(2022)3-Stage Pipelined Hierarchical SRAMs with Burst Mode Read in 65nm LSTP CMOS2022 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS48785.2022.9937582(1546-1550)Online publication date: 28-May-2022
https://doi.org/10.1109/ISCAS48785.2022.9937582
Kong JHur J(2020)Near-Threshold L1 Data Cache for Yield Management Under Process VariationsIEEE Access10.1109/ACCESS.2020.29686038(18558-18570)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2968603
Zhou YZhang CSun GWang KZhang YAyala JJones AMadden PCoskun A(2013)Asymmetric-access aware optimization for STT-RAM caches with process variationsProceedings of the 23rd ACM international conference on Great lakes symposium on VLSI10.1145/2483028.2483079(143-148)Online publication date: 2-May-2013
https://dl.acm.org/doi/10.1145/2483028.2483079
Show More Cited By

Variable latency caches for nanoscale processor
1. Hardware

Recommendations

Fetch Caches
Performance of One's Complement Caches

On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which ...
Reactive-Associative Caches
PACT '01: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques

Abstract: While set-associative caches typically incur fewer misses than direct-mapped caches, set-associative caches have slower hit times. We propose the reactive-associative cache (r-a cache), which provides flexible associativity by placing most ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

November 2007

723 pages

ISBN:9781595937643

DOI:10.1145/1362622

General Chair:
Becky Verastegui
Oak Ridge National Laboratory

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC '07

Sponsor:

SIGARCH
IEEE-CS

SC '07: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2007

Nevada, Reno

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
191
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Srivastav MRimjhim Mishra RGrover ADhori KRawat H(2022)3-Stage Pipelined Hierarchical SRAMs with Burst Mode Read in 65nm LSTP CMOS2022 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS48785.2022.9937582(1546-1550)Online publication date: 28-May-2022
https://doi.org/10.1109/ISCAS48785.2022.9937582
Kong JHur J(2020)Near-Threshold L1 Data Cache for Yield Management Under Process VariationsIEEE Access10.1109/ACCESS.2020.29686038(18558-18570)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2968603
Zhou YZhang CSun GWang KZhang YAyala JJones AMadden PCoskun A(2013)Asymmetric-access aware optimization for STT-RAM caches with process variationsProceedings of the 23rd ACM international conference on Great lakes symposium on VLSI10.1145/2483028.2483079(143-148)Online publication date: 2-May-2013
https://dl.acm.org/doi/10.1145/2483028.2483079
Colmenar JGarnica OLanchares JHidalgo J(2010)Simulating a LAGS processor to consider variable latency on L1 D-CacheProceedings of the 2010 Summer Computer Simulation Conference10.5555/1999416.1999421(56-63)Online publication date: 11-Jul-2010
https://dl.acm.org/doi/10.5555/1999416.1999421

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten