ABSTRACT
A simple and low-cost approach to supporting snoopy cache coherence is to logically embed a unidirectional ring in the network of a multiprocessor, and use it to transfer snoop messages. Other messages can use any link in the network. While this scheme works for any network topology, a naive implementation may result in long response times or in many snoop messages and snoop operations. To address this problem, this paper proposes Flexible Snooping algorithms, a family of adaptive forwarding and filtering snooping algorithms. In these algorithms, a node receiving a snoop request may either forward it to another node and then perform the snoop, or snoop and then forward it, or simply forward it without snooping. The resulting design space offers trade-offs in number of snoop operations and messages, response time, and energy consumption. Our analysis using SPLASH-2, SPECjbb, and SPECweb workloads finds several snooping algorithms that are more costeffective than current ones. Specifically, our choice for a highperformance snooping algorithm is faster than the currently fastest algorithm while consuming 9-17% less energy; our choice for an energy-efficient algorithm is only 3-6% slower than the previous one while consuming 36-42% less energy.
- {1} M. E. Acacio, J. González, J. M. García, and J. Duato. Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture. In High Performance Computing, Networks and Storage Conference (SC), Nov 2002. Google ScholarDigital Library
- {2} L. Barroso and M. Dubois. The Performance of Cache-Coherent Ring-based Multiprocessors. In International Symposium on Computer Architecture, May 1993. Google ScholarDigital Library
- {3} B. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 11(7):422-426, July 1970. Google ScholarDigital Library
- {4} J. F. Cantin, M. H. Lipasti, and J. E. Smith. Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. In International Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- {5} D. E. Culler and J. P. Singh. Parallel Computer Architecture; A Hard-ware/Software Approach. Morgan Kaufmann, 1999. Google ScholarDigital Library
- {6} M. Ekman, F. Dahlgren, and P. Stenström. Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors. In Workshop on Duplicating, Deconstructing, and Debunking, May 2002.Google Scholar
- {7} HyperTransport Technology Consortium. HyperTransport I/O Link Specification , 2.00b edition, April 2005.Google Scholar
- {8} R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. In International Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- {9} M. Martin, P. Harper, D. Sorin, M. Hill, and D. Wood. Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors. In International Symposium on Computer Architecture, June 2003. Google ScholarDigital Library
- {10} M. Martin, M. Hill, and D. Wood. Token Coherence: Decoupling Performance and Correctness. In International Symposium on Computer Architecture, June 2003. Google ScholarDigital Library
- {11} M. Marty, J. Bingham, M. Hill, A. Hu, M. Martin, and D. Wood. Improving Multiple-CMP Systems Using Token Coherence. In International Symposium on High-Performance Computer Architecture, Feb 2005. Google ScholarDigital Library
- {12} Micron Technology, Inc. System-Power Calculator. http://www.micron.com/products/dram/syscalc.html.Google Scholar
- {13} A. Moshovos. RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In International Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- {14} A. Moshovos, G. Memik, B. Falsafi, and A. Choudhary. JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers. In International Symposium on High-Perfomance Computer Architecture, Jan 2001. Google ScholarDigital Library
- {15} J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, K. Strauss, S. Sarangi, P. Sack, and P. Montesinos. SESC Simulator, Jan 2005. http://sesc.sourceforge.net.Google Scholar
- {16} C. Saldanha and M. Lipasti. Power Efficient Cache Coherence. In Workshop on Memory Performance Issues, June 2001.Google Scholar
- {17} X. Shen. A Snoop-and-Forward Cache Coherence Protocol for SMP Systems with Ring-based Address Networks. Technical report, IBM T. J. Watson Research Center, June 2004.Google Scholar
- {18} P. Shivakumar and N. Jouppi. CACTI 3.0: An Integrated Cache Timing, Power and Area Model. Technical Report 2001/2, Compaq Computer Corporation, Aug 2001.Google Scholar
- {19} Silicon Graphics. Silicon Graphics Altrix 3000 Scalable 64-bit Linux Platform. http://www.sgi.com/products/servers/altix/.Google Scholar
- {20} Standard Performace Evaluation Corporation (SPEC). http://www.spec.org.Google Scholar
- {21} Sun Microsystems. Sun Enterprise 10000 Server Overview. http://www.sun.com/servers/highend/e10000/.Google Scholar
- {22} J. M. Tendler, J. S. Dodson, J. S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. In IBM Journal of Research and Development, Jan 2002. Google ScholarDigital Library
- {23} Virtutech. Virtutech Simics. http://www.virtutech.com/products.Google Scholar
- {24} Z. Vranesic, M. Stumm, D. Lewis, and R. White. Hector: A Hierarchically Structured Shared-Memory Multiprocessor. In IEEE Computer Magazine, Jan 1991. Google ScholarDigital Library
- {25} H. S. Wang, X. P. Zhu, L. S. Peh, and S. Malik. Orion:A Power-Performance Simulator for Interconnection Networks. In International Symposium on Microarchitecture , Nov 2002. Google ScholarDigital Library
- {26} S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In International Symposium on Computer Architecture, June 1995. Google ScholarDigital Library
Index Terms
- Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors
Recommendations
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors
A simple and low-cost approach to supporting snoopy cache coherence is to logically embed a unidirectional ring in the network of a multiprocessor, and use it to transfer snoop messages. Other messages can use any link in the network. While this scheme ...
Subspace Snooping: Exploiting Temporal Sharing Stability for Snoop Reduction
Although snoop-based coherence protocols provide fast cache-to-cache transfers with a simple and robust coherence mechanism, scaling the protocols has been difficult due to the overheads of broadcast snooping. In this paper, we propose a coherence ...
Evaluating the performance of four snooping cache coherency protocols
Special Issue: Proceedings of the 16th annual international symposium on Computer ArchitectureWrite-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases; and large ...
Comments