skip to main content
10.1145/1088149.1088154acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

A NUCA substrate for flexible CMP cache sharing

Published:20 June 2005Publication History

ABSTRACT

We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

References

  1. V. Agarwal, S. W. Keckler, and D. Burger. The effect of technology scaling on microarchitecture structures. Technical Report TR-00-02, Department of Computer Sciences, University of Texas at Austin, May 2001.Google ScholarGoogle Scholar
  2. J.-L. Baer and T.-F. Chen. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computer, 44(5):609--623, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In The 27th Annual International Symposium on Computer Architecture, pages 282--293, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In 37th International Symposium on Microarchitecture (MICRO), December 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Chishti, M. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In The 36th Annual International Symposium on Microarchitecture (MICRO), pages 55--66, December 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication, and capacity allocation in cmps. In Proceedings of the 32nd annual international symposium on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, pages 71--84, December 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence decoupling: Making use of incoherence. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Iyer. CQoS: a framework for enabling QoS in shared caches of cmp platforms. In Proceedings of the 18th annual international conference on Supercomputing, pages 257--266, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th annual international symposium on Computer Architecture, pages 364--373, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip: A dual-core multithreaded processor. IEEE Micro, 24(2), Mar/Apr 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Kessler, R. Jooss, A. Lebeck, and M. Hill. Inexpensive implementations of set-associativity. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 131--139, May 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 211--222, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT'04), pages 111--122, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for cmps. In Proceedings of the 10th International Symposium High Performance Computer Architecture, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of design alternatives for a multiprocessor microprocessor. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 67--77, May 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. A. Nayfeh, K. Olukotun, and J. P. Singh. The impact of shared-cache clustering in small-scale shared-memory multiprocessors. In Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture (HPCA), page 74, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical Report 2001-2, HP, Western Research Laboratory, 2001.Google ScholarGoogle Scholar
  19. G. Sohi and M. Franklin. High-performance data memory systems for superscalar processors. In Proceedings of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 53--62, Apr. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. In Proceedings of the 32nd annual international symposium on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the 8th International Symposium High Performance Computer Architecture, Feb. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. Journal of Supercomputing, 28(1):7--26, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system microarchitecture. IBM Journal of Research and Development, 46(1), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 24--36, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A NUCA substrate for flexible CMP cache sharing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICS '05: Proceedings of the 19th annual international conference on Supercomputing
          June 2005
          414 pages
          ISBN:1595931678
          DOI:10.1145/1088149

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 June 2005

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate584of2,055submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader