Article

Data trace cache: an application specific cache architecture

Authors:

Subramanian Ramaswamy,

Jaswanth Sreeram,

Sudhakar Yalamanchili,

Krishna V. PalemAuthors Info & Claims

MEDEA '05: Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications, systems and architecture

Pages 11 - 18

https://doi.org/10.1145/1152779.1147354

Published: 17 September 2005 Publication History

Abstract

Benefits of advances in processor technology have long been held hostage to the widening processor-memory gap. Off-chip memory access latency is one of the most critical parameters limiting system performance. Caches have been used as a way of alleviating this problem by reducing the average memory access latency. The memory bottleneck assumes greater significance for high performance computer architectures with high data throughput requirements such as network processors.This paper addresses the memory bottleneck with the goal of minimizing off-chip memory demand and average memory access latency by proposing the use of small application specific compiler-visible data trace caches. We focus on tree data structures which are responsible for a significant component of the memory traffic in several applications. We have observed that tree accesses create a simple to characterize trace of memory references and propose a data trace cache design to exploit the locality of reference in these data traces.Our study reveals that data trace caches can reduce the total number of misses from 7% to 53% for accesses to rooted tree data structures as compared to a conventional cache for a variety of applications for small cache sizes (256 - 1024 bytes). Such caches are in keeping with the philosophy of victim caches, stream buffers, and pre-fetch buffers in that relatively small investments in silicon can realize substantive reduction in off-chip memory bandwidth demand.

References

[1]

Intel Itanium 2 Processor Hardware Developer's Manual, July 2002.

[2]

Intel IXP2800 Network Processor Hardware Reference Manual, November 2002.

[3]

C.-K. Luk and T. C. Mowry, "Automatic compiler-inserted prefetching for pointer-based applications." IEEE Transactions on Computers, vol. 48, no. 2, pp. 134--141, 1999.

Digital Library

[4]

T. C. Mowry, M. S. Lam, and A. Gupta, "Design and evaluation of a compiler algorithm for prefetching." in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS), 1992, pp. 62--73.

Digital Library

[5]

J. Kim, R. M. Rabbah, K. V. Palem, and W.-F. Wong, "Adaptive compiler directed prefetching for epic processors." in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA), 2004, pp. 495--501.

[6]

J. Kim, K. V. Palem, and W.-F. Wong, "A framework for data prefetching using off-line training of markovian predictors." in Proceedings of the 20th International Conference on Computer Design (ICCD), VLSI in Computers and Processors, September 2002, pp. 340--347.

Digital Library

[7]

S. Jiang and X. Zhang, "LIRS: An efficient low inter-reference recency set replacement policy to improve buffer cache performance." in Proceedings of the International Conference on Measurements and Modeling of Computer Systems, SIGMETRICS, June 2002, pp. 31--42.

Digital Library

[8]

K. Hazelwood, M. C. Toburen, and T. M. Conte, "A case for exploiting memory-access persistence," in "Workshop on Memory Performance Issues, June 2001.

[9]

T. M. Chilimbi, M. D. Hill, and J. R. Larus, "Cache-conscious structure layout." in Proceedings of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1999, pp. 1--12.

Digital Library

[10]

R. M. Rabbah and K. V. Palem, "Data remapping for design space optimization of embedded memory systems." ACM Transactions in Embedded Computing Systems, vol. 2, no. 2, pp. 186--218, 2003.

Digital Library

[11]

P. C., "Locality and route caches," in Proceedings of the NSF Workshop on Internet Statistics Measurement and Analysis, February 1996.

[12]

T. cker Chiueh and P. Pradhan, "High performance routing table lookup using CPU caching," in Proceedings IEEE INFOCOM, The Conference on Computer Communications, Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3, March 1999, pp. 1421--1428. {Online}. Available: citeseer.ist.psu.edu/article/chiueh99highperformance.html

[13]

K. Gopalan and T. cker Chiueh, "Improving route lookup performance using network processor cache." in Proceedings of the 2002 ACM/IEEE conference on Supercomputing, November 2002, pp. 1--10.

Digital Library

[14]

J.-L. Baer, D. Low, P. Crowley, and N. Sidhwaney, "Memory hierarchy design for a multiprocessor look-up engine." in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2003, pp. 206--216.

Digital Library

[15]

P. Gupta and N. McKeown, "Packet classification on multiple fields." in Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication(SIGCOMM), 1999, pp. 147--160.

Digital Library

[16]

V. Srinivasan, S. Suri, and G. Varghese, "Packet classification using tuple space search." in Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, 1999, pp. 135--146.

Digital Library

[17]

T. Wolf and M. Franklin, "Commbench --- a telecommunications benchmark for network processors." in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, April 2000, pp. 154--162. {Online}. Available: citeseer.ist.psu.edu/wolf00commbench.html

Digital Library

[18]

"The Stony Brook Algorithm Repository." {Online}. Available: http://www.cs.sunysb.edu/~algorith/

[19]

"Valgrind tool suite version - 2.1.2." {Online}. Available: http://www.valgrind.org

[20]

"Dinero IV Trace-Driven Uniprocessor Cache Simulator." {Online}. Available: http://www.cs.wisc.edu/~markhill/DineroIV

[21]

"CACTI, HP-Compaq Western Research Lab." {Online}. Available: http://research.compaq.com/wrl/people/jouppi/CACTI.html

[22]

I. Stoica, "Stateless core: A scalable approach for quality of service in the internet," 2001, doctoral Dissertation.

Digital Library

Cited By

Ramaswamy SYalamanchili S(2007)Customized placement for high performance embedded processor cachesProceedings of the 20th international conference on Architecture of computing systems10.5555/1763274.1763280(69-82)Online publication date: 12-Mar-2007
https://dl.acm.org/doi/10.5555/1763274.1763280
Ramaswamy SYalamanchili S(2007)Customized Placement for High Performance Embedded Processor CachesArchitecture of Computing Systems - ARCS 200710.1007/978-3-540-71270-1_6(69-82)Online publication date: 2007
https://doi.org/10.1007/978-3-540-71270-1_6

Index Terms

Data trace cache: an application specific cache architecture

Recommendations

Data trace cache: an application specific cache architecture
Special issue: MEDEA'05

Benefits of advances in processor technology have long been held hostage to the widening processor-memory gap. Off-chip memory access latency is one of the most critical parameters limiting system performance. Caches have been used as a way of ...
Trace Cache Miss Rate

Instruction fetch mechanism is a performance bottleneck of Superscalar and Simultaneous Multithreading Processors. A hardware mechanism, known as Trace Cache, is used in several processor architectures to improve instruction fetch performance. Most ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MEDEA '05: Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture

September 2005

76 pages

Conference Chairs:
Sandro Bartolini
University of Siena, Italy
,
Pierfrancesco Foglia
University of Pisa, Italy
,
Roberto Giorgi
University of Siena, Italy
,
Cosimo Antonio Prete
University of Pisa, Italy

ACM SIGARCH Computer Architecture News Volume 34, Issue 1
Special issue: MEDEA'05
March 2006
86 pages
ISSN:0163-5964
DOI:10.1145/1147349
Issue’s Table of Contents

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Computer Society

United States

Publication History

Published: 17 September 2005

Check for updates

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 6 of 9 submissions, 67%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
372
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ramaswamy SYalamanchili S(2007)Customized placement for high performance embedded processor cachesProceedings of the 20th international conference on Architecture of computing systems10.5555/1763274.1763280(69-82)Online publication date: 12-Mar-2007
https://dl.acm.org/doi/10.5555/1763274.1763280
Ramaswamy SYalamanchili S(2007)Customized Placement for High Performance Embedded Processor CachesArchitecture of Computing Systems - ARCS 200710.1007/978-3-540-71270-1_6(69-82)Online publication date: 2007
https://doi.org/10.1007/978-3-540-71270-1_6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten