ABSTRACT
We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.
- V. Agarwal, S. W. Keckler, and D. Burger. The effect of technology scaling on microarchitecture structures. Technical Report TR-00-02, Department of Computer Sciences, University of Texas at Austin, May 2001.Google Scholar
- J.-L. Baer and T.-F. Chen. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computer, 44(5):609--623, 1995. Google ScholarDigital Library
- L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In The 27th Annual International Symposium on Computer Architecture, pages 282--293, June 2000. Google ScholarDigital Library
- B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In 37th International Symposium on Microarchitecture (MICRO), December 2004. Google ScholarDigital Library
- Z. Chishti, M. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In The 36th Annual International Symposium on Microarchitecture (MICRO), pages 55--66, December 2003. Google ScholarDigital Library
- Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication, and capacity allocation in cmps. In Proceedings of the 32nd annual international symposium on Computer Architecture, 2005. Google ScholarDigital Library
- L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, pages 71--84, December 2000. Google ScholarDigital Library
- J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence decoupling: Making use of incoherence. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 2004. Google ScholarDigital Library
- R. Iyer. CQoS: a framework for enabling QoS in shared caches of cmp platforms. In Proceedings of the 18th annual international conference on Supercomputing, pages 257--266, 2004. Google ScholarDigital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th annual international symposium on Computer Architecture, pages 364--373, 1990. Google ScholarDigital Library
- R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip: A dual-core multithreaded processor. IEEE Micro, 24(2), Mar/Apr 2004. Google ScholarDigital Library
- R. Kessler, R. Jooss, A. Lebeck, and M. Hill. Inexpensive implementations of set-associativity. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 131--139, May 1989. Google ScholarDigital Library
- C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 211--222, October 2002. Google ScholarDigital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT'04), pages 111--122, 2004. Google ScholarDigital Library
- C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for cmps. In Proceedings of the 10th International Symposium High Performance Computer Architecture, Feb. 2004. Google ScholarDigital Library
- B. A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of design alternatives for a multiprocessor microprocessor. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 67--77, May 1996. Google ScholarDigital Library
- B. A. Nayfeh, K. Olukotun, and J. P. Singh. The impact of shared-cache clustering in small-scale shared-memory multiprocessors. In Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture (HPCA), page 74, 1996. Google ScholarDigital Library
- P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical Report 2001-2, HP, Western Research Laboratory, 2001.Google Scholar
- G. Sohi and M. Franklin. High-performance data memory systems for superscalar processors. In Proceedings of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 53--62, Apr. 1991. Google ScholarDigital Library
- E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. In Proceedings of the 32nd annual international symposium on Computer Architecture, 2005. Google ScholarDigital Library
- G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the 8th International Symposium High Performance Computer Architecture, Feb. 2002. Google ScholarDigital Library
- G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. Journal of Supercomputing, 28(1):7--26, 2004. Google ScholarDigital Library
- J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system microarchitecture. IBM Journal of Research and Development, 46(1), 2002. Google ScholarDigital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 24--36, 1995. Google ScholarDigital Library
Index Terms
- A NUCA substrate for flexible CMP cache sharing
Recommendations
A NUCA substrate for flexible CMP cache sharing
ACM International Conference on Supercomputing 25th Anniversary VolumeWe propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network ...
A NUCA Substrate for Flexible CMP Cache Sharing
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched ...
Comparing last-level cache designs for CMP architectures
IFMT '10: Proceedings of the Second International Forum on Next-Generation Multicore/Manycore TechnologiesThe emergence of hardware accelerators, such as graphics processing units (GPUs), has challenged the interaction between processing elements (PEs) and main memory. In architectures like the Cell/B.E. or GPUs, the PEs incorporate local memories which are ...
Comments