ABSTRACT
This paper presents Cooperative Cache Partitioning (CCP) to allocate cache resources among threads concurrently running on CMPs. Unlike cache partitioning schemes that use a single spatial partition repeatedly throughout a stable program phase, CCP resolves cache contention with multiple time-sharing partitions. Timesharing cache resources among partitions allows each thrashing thread to speed up dramatically in at least one partition by unfairly shrinking other threads' capacity allocations, while improving fairness by giving different partitions equal chance to execute. Quality-of-Service (QoS) is guaranteed over the long term by orchestrating the shrink and expansion of each thread's capacity across partitions to bound the average slowdown. Time-sharing based cache partitioning is further integrated with CMP cooperative caching [6] to exploit the benefits of LRU-based latency optimizations, which leads to a simplified partitioning algorithm and better performance for workloads that do not benefit from cache partitioning. We evaluate the effectiveness of CCP by simulating a 4-core CMP running all combinations of 7 representative SPEC2000 benchmarks. For workloads that can benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than the exhaustive search of optimal static partitions. Overall, CCP provides the best results on almost all evaluation metrics for different cache sizes.
- L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA-27), 2000. Google ScholarDigital Library
- B. M. Beckmann, M. R. Marty, and D. A. Wood. ASR: Adaptive Selective Replication for CMP Caches. In Proceedings of the 39th Annual International Symposium on Microarchitecture (MICRO-39), 2006. Google ScholarDigital Library
- D. Bertsekas and R. Gallager. Data Networks (2nd ed.). Prentice-Hall, 1992. Google ScholarDigital Library
- J. Bruno, E. Gabber, B. Özden, and A. Silberschatz. The Eclipse Operating System: Providing Quality of Service via Reservation Domains. In USENIX 1998 Annual Technical Conference, 1998. Google ScholarDigital Library
- R. W. Carr and J. L. Hennessy. WSClock - a Simple and Effective Algorithm for Virtual Memory Management. In Proceedings of the 8th ACM Symposium on Operating Systems Principles (SOSP-8), 1981. Google ScholarDigital Library
- J. Chang and G. S. Sohi. Cooperative Caching for Chip Multiprocessors. In Proceedings of the 33th Annual International Symposium on Computer Architecture (ISCA-33), 2006. Google ScholarDigital Library
- D. T. Chiou. Extending the Reach of Microprocessors: Column and Curious Caching. PhD thesis, MIT, 1999. Google ScholarDigital Library
- Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing Replication, Communication and Capacity Allocation in CMPs. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA-32), 2005. Google ScholarDigital Library
- P. J. Denning. Thrashing: Its Causes and Prevention. In AFIPS 1968 Fall Joint Computer Conference, volume 33, pages 915--922, 1968. Google ScholarDigital Library
- H. Dybdahl and P. Stenstrom. An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA-13), 2007. Google ScholarDigital Library
- H. Dybdahl, P. Stenstrom, and L. Natvig. A Cache-Partition Aware Replacement Policy for Chip Multiprocessors. In ACM 2006 Conference on High Performance Computing (HiPC-13), 2006. Google ScholarDigital Library
- R. Fagin and J. H. Williams. A fair carpool scheduling algorithm. IBM Journal of Research and Development, 27(2):133--139, 1983. Google ScholarDigital Library
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance Of Multithreaded Chip Multiprocessors And Implications For Operating System Design. In USENIX 2005 Annual Technical Conference, 2005. Google ScholarDigital Library
- L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, 20(2):71--84, 2000. Google ScholarDigital Library
- J. L. Hellerstein. Achieving service rate objectives with decay usage scheduling. IEEE Transaction of Software Engineering, 19(8), 1993. Google ScholarDigital Library
- L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, Utilitarian, and Capitalist Cache Policies on CMPs: Caches as a Shared Resource. In Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques (PACT-15), 2006. Google ScholarDigital Library
- J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA Substrate for Flexible CMP Cache Sharing. In Proceedings of the 19th ACM International Conference on Supercomputing (ICS-19), 2005. Google ScholarDigital Library
- R. Iyer. CQoS: a Framework for Enabling QoS in Shared Caches of CMP Platforms. In Proceedings of the 18th ACM International Conference on Supercomputing (ICS-18), 2004. Google ScholarDigital Library
- L. K. John. More on Finding a Single number to Indicate Overall Performance of a Benchmark Suite. SIGARCH Computer Architecture News, 32(1), 2004. Google ScholarDigital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. In Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT-13), 2004. Google ScholarDigital Library
- C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs. In Proceedings of the 10th International Symposium on High-Performance Computer Architecture (HPCA-10), 2004. Google ScholarDigital Library
- K. Luo, J. Gummaraju, and M. Franklin. Balancing Throughput and Fairness in SMT Processors. In Proceedings of the 2001 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2001), 2001.Google Scholar
- P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, Feb 2002. Google ScholarDigital Library
- M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News, 2005. Google ScholarDigital Library
- K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory Systems. In Proceedings of the 39th Annual International Symposium on Microarchitecture (MICRO-39), 2006. Google ScholarDigital Library
- A. K. Parekh and R. G. Gallager. A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: the Single-node Case. IEEE Transaction of Networks, 1(3):344--357, 1993. Google ScholarDigital Library
- P. Petoumenos, G. Keramidas, H. Zeffer, S. Kaxiras, and E. Hagersten. STATSHARE: A Statistical Model for Managing Cache Sharing via Decay. In Second Annual Workshop on Modeling, Benchmarking and Simulation (MoBS 2006), 2006.Google Scholar
- M. M. Planas, F. Cazorla, A. Ramirez, and M. Valero. Explaining Dynamic Cache Partitioning Speed Ups. IEEE Computer Architecture Letters, 6(1), 2007. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proceedings of the 39th Annual International Symposium on Microarchitecture (MICRO-39), 2006. Google ScholarDigital Library
- M. K. Qureshi, D. Thompson, and Y. N. Patt. Tl e V-way Cache: Demand Based Associativity via Global Replacement. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA-32), 2005. Google ScholarDigital Library
- N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural Support for Operating System-Driven CMP Cache Management. In Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques (PACT-15), 2006. Google ScholarDigital Library
- J. E. Smith. Characterizing Computer Performance with a Single Number. Communication of ACM, 31(10), 1988. Google ScholarDigital Library
- A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreaded Processor. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), 2000. Google ScholarDigital Library
- A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2002. Google ScholarDigital Library
- E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA-32), 2005. Google ScholarDigital Library
- H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Transaction of Computers, 41(9):1054--1068, 1992. Google ScholarDigital Library
- G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-aware Scheduling and Partitioning. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture (HPCA-8), 2002. Google ScholarDigital Library
- G. E. Suh, L. Rudolph, and S. Devadas. Dynamic Partitioning of Shared Cache Memory. Journal of Supercomputing, 28(1):7--26, 2004. Google ScholarDigital Library
- J. M. Tendler, J. S. Dodson, J. S. F. Jr., H. Le, and B. Sinharoy. IBM Power4 system microarchitecture. IBM Journal of Research and Development, 46(1):5--26, 2002. Google ScholarDigital Library
- B. Verghese, A. Gupta, and M. Rosenblum. Performance Isolation: Sharing and Isolation in Shared-memory Multiprocessors. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), 1998. Google ScholarDigital Library
- C. A. Waldspurger. Lottery and stride scheduling: Flexible proportional-share resource management. Technical Report MIT/LCS/TR-667, Cambridge, MA, USA, 1995. Google ScholarDigital Library
- T. Y. Yeh and G. Reinman. Fast and Fair: Data-stream Quality of Service. In Proceedings of the 2005 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'05), 2005. Google ScholarDigital Library
- M. Zhang and K. Asanovic. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled CMPs. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA-32), 2005. Google ScholarDigital Library
Index Terms
- Cooperative cache partitioning for chip multiprocessors
Recommendations
Cooperative Caching for Chip Multiprocessors
This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through ...
Cooperative cache partitioning for chip multiprocessors
ICS '07: Proceedings of the 21st annual international conference on SupercomputingThis paper presents Cooperative Cache Partitioning (CCP) to allocate cache resources among threads concurrently running on CMPs. Unlike cache partitioning schemes that use a single spatial partition repeatedly throughout a stable program phase, CCP ...
IPC-Based Cache Partitioning: An IPC-Oriented Dynamic Shared Cache Partitioning Mechanism
ICHIT '08: Proceedings of the 2008 International Conference on Convergence and Hybrid Information TechnologyIn a chip-multiprocessor with a shared cache structure, the last level cache is shared by multiple applications executing simultaneously. The competing accesses from different applications degrade the system performance, resulting in non-predicting ...
Comments