Article

A NUCA substrate for flexible CMP cache sharing

Authors:
Jaehyuk Huh

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Changkyu Kim

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Hazim Shafi

Austin Research Laboratory, IBM Research

Austin Research Laboratory, IBM Research
View Profile

,
Lixin Zhang

Austin Research Laboratory, IBM Research

Austin Research Laboratory, IBM Research
View Profile

,
Doug Burger

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Stephen W. Keckler

The University of Texas at Austin

The University of Texas at Austin
View Profile

ICS '05: Proceedings of the 19th annual international conference on SupercomputingJune 2005Pages 31–40https://doi.org/10.1145/1088149.1088154

Published:20 June 2005Publication History

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

Pages 31–40

ABSTRACT

We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

References

V. Agarwal, S. W. Keckler, and D. Burger. The effect of technology scaling on microarchitecture structures. Technical Report TR-00-02, Department of Computer Sciences, University of Texas at Austin, May 2001.Google Scholar
J.-L. Baer and T.-F. Chen. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computer, 44(5):609--623, 1995. Google ScholarDigital Library
L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In The 27th Annual International Symposium on Computer Architecture, pages 282--293, June 2000. Google ScholarDigital Library
B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In 37th International Symposium on Microarchitecture (MICRO), December 2004. Google ScholarDigital Library
Z. Chishti, M. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In The 36th Annual International Symposium on Microarchitecture (MICRO), pages 55--66, December 2003. Google ScholarDigital Library
Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication, and capacity allocation in cmps. In Proceedings of the 32nd annual international symposium on Computer Architecture, 2005. Google ScholarDigital Library
L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, pages 71--84, December 2000. Google ScholarDigital Library
J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence decoupling: Making use of incoherence. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 2004. Google ScholarDigital Library
R. Iyer. CQoS: a framework for enabling QoS in shared caches of cmp platforms. In Proceedings of the 18th annual international conference on Supercomputing, pages 257--266, 2004. Google ScholarDigital Library
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th annual international symposium on Computer Architecture, pages 364--373, 1990. Google ScholarDigital Library
R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip: A dual-core multithreaded processor. IEEE Micro, 24(2), Mar/Apr 2004. Google ScholarDigital Library
R. Kessler, R. Jooss, A. Lebeck, and M. Hill. Inexpensive implementations of set-associativity. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 131--139, May 1989. Google ScholarDigital Library
C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 211--222, October 2002. Google ScholarDigital Library
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT'04), pages 111--122, 2004. Google ScholarDigital Library
C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for cmps. In Proceedings of the 10th International Symposium High Performance Computer Architecture, Feb. 2004. Google ScholarDigital Library
B. A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of design alternatives for a multiprocessor microprocessor. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 67--77, May 1996. Google ScholarDigital Library
B. A. Nayfeh, K. Olukotun, and J. P. Singh. The impact of shared-cache clustering in small-scale shared-memory multiprocessors. In Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture (HPCA), page 74, 1996. Google ScholarDigital Library
P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical Report 2001-2, HP, Western Research Laboratory, 2001.Google Scholar
G. Sohi and M. Franklin. High-performance data memory systems for superscalar processors. In Proceedings of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 53--62, Apr. 1991. Google ScholarDigital Library
E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. In Proceedings of the 32nd annual international symposium on Computer Architecture, 2005. Google ScholarDigital Library
G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the 8th International Symposium High Performance Computer Architecture, Feb. 2002. Google ScholarDigital Library
G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. Journal of Supercomputing, 28(1):7--26, 2004. Google ScholarDigital Library
J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system microarchitecture. IBM Journal of Research and Development, 46(1), 2002. Google ScholarDigital Library
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 24--36, 1995. Google ScholarDigital Library

Index Terms

A NUCA substrate for flexible CMP cache sharing

Recommendations

A NUCA substrate for flexible CMP cache sharing
ACM International Conference on Supercomputing 25th Anniversary Volume

We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network ...
Read More
A NUCA Substrate for Flexible CMP Cache Sharing

We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched ...
Read More
Comparing last-level cache designs for CMP architectures
IFMT '10: Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies

The emergence of hardware accelerators, such as graphics processing units (GPUs), has challenged the interaction between processing elements (PEs) and main memory. In architectures like the Cell/B.E. or GPUs, the PEs incorporate local memories which are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '05: Proceedings of the 19th annual international conference on Supercomputing
June 2005
414 pages
ISBN:1595931678
DOI:10.1145/1088149
General Chair:
Arvind
MIT
,
Program Chair:
Larry Rudolph
MIT
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache sharing
chip-multiprocessor
non-uniform cache architecture
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 156
  Total Citations
  View Citations
- 998
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A NUCA substrate for flexible CMP cache sharing

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A NUCA substrate for flexible CMP cache sharing

A NUCA Substrate for Flexible CMP Cache Sharing

Comparing last-level cache designs for CMP architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A NUCA substrate for flexible CMP cache sharing

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A NUCA substrate for flexible CMP cache sharing

A NUCA Substrate for Flexible CMP Cache Sharing

Comparing last-level cache designs for CMP architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media