ABSTRACT
There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.
- S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. IEEE Computer, 29(12):66--76, Dec. 1996. Google ScholarDigital Library
- V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock Rate versus IPC: the End of the Road for Conventional Microarchitectures. In Proceedings of the 27th Intl. Symp. on Computer Architecture, June 2000. Google ScholarDigital Library
- J. Ahn et al. Evaluating the Imagine Stream Architecture. In Proceedings of the 31st Intl. Symp. on Computer Architecture, May 2004. Google ScholarDigital Library
- J. Andrews and N. Backer. Xbox360 System Architecture. In Conf. Record of Hot Chips 17, Stanford, CA, Aug. 2005.Google Scholar
- L. A. Barroso et al. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of the 27th Intl. Symp. on Computer Architecture, Vancouver, Canada, June 2000. Google ScholarDigital Library
- I. Buck. GPU Computing: Programming a Massively Parallel Processor, Mar. 2005. Keynote presentation at the International Symposium on Code Generation and Optimization, San Jose, CA. Google ScholarDigital Library
- T. Chiueh. A Generational Algorithm to Multiprocessor Cache Coherence. In International Conference on Parallel Processing, pages 20--24, Oct. 1993. Google ScholarDigital Library
- D. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kauffman, 1999. Google ScholarDigital Library
- W. Dally et al. Merrimac: Supercomputing with Streams. In Proceedings of the 2003 Conf. on Supercomputing, Nov. 2003. Google ScholarDigital Library
- J. D. Davis, J. Laudon, and K. Olukotun. Maximizing CMP Throughput with Mediocre Cores. In Proceedings of the 14th Intl. Conf. on Parallel Architectures and Compilation Techniques, Sept. 2005. Google ScholarDigital Library
- M. Drake, H. Hoffmann, R. Rabbah, and S. Amarasinghe. MPEG-2 Decoding in a Stream Programming Language. In Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium, Rhodes Island (IPDPS), Apr. 2006. Google ScholarDigital Library
- W. Eatherton. The Push of Network Processing to the Top of the Pyramid, Oct. 2005. Keynote presentation at the Symposium on Architectures for Networking and Communication Systems, Princeton, NJ.Google Scholar
- K. Fatahalian et al. Sequoia: Programming The Memory Hierarchy. In Supercomputing Conference, Nov. 2006. Google ScholarDigital Library
- T. Foley and J. Sugerman. KD-Tree Acceleration Structures for a GPU Raytracer. In Proceedings of the Graphics Hardware Conf., July 2005. Google ScholarDigital Library
- M. I. Gordon et al. A Stream Compiler for Communication-exposed Architectures. In Proceedings of the 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. Google ScholarDigital Library
- M. Gschwind et al. A Novel SIMD Architecture for the Cell Heterogeneous Chip-Multiprocessor. In Conf. Record of Hot Chips 17, Stanford, CA, Aug. 2005.Google ScholarCross Ref
- J. Gummaraju and M. Rosenblum. Stream Programming on General-Purpose Processors. In Proceedings of the 38th Intl. Symp. on Microarchitecture, Nov. 2005. Google ScholarDigital Library
- R. Ho, K. Mai, and M. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4), Apr. 2001.Google ScholarCross Ref
- R. Ho, K. Mai, and M. Horowitz. Efficient On-chip Global Interconnects, June 2003.Google Scholar
- M. Horowitz and W. Dally. How Scaling Will Change Processor Architecture. In International Solid-State Circuits Conference, pages 132--133, Feb. 2004.Google Scholar
- Independent JPEG Group. IJG's JPEG Software Release 6b, 1998.Google Scholar
- D. Jani, G. Ezer, and J. Kim. Long Words and Wide Ports: Reinventing the Configurable Processor. In Conf. Record of Hot Chips 16, Stanford, CA, Aug. 2004.Google Scholar
- N. Jayasena. Memory Hierarchy Design for Stream Computing. PhD thesis, Stanford University, 2005. Google ScholarDigital Library
- A. C. Klaiber and H. M. Levy. A Comparison of Message Passing and Shared Memory Architectures for Data Parallel Programs. In Proceedings of the 21th Intl. Symp. on Computer Architecture, Apr. 1994. Google ScholarDigital Library
- P. Kongetira. A 32-way Multithreaded Sparc Processor. In Conf. Record of Hot Chips 16, Stanford, CA, Aug. 2004.Google Scholar
- R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. In Proceedings of the 32nd Intl. Symp. on Computer Architecture, June 2005. Google ScholarDigital Library
- B. Lewis and D. J. Berg. Multithreaded Programming with Pthreads. Prentice Hall, 1998. Google ScholarDigital Library
- M. Li et al. ALP: Efficient Support for All Levels of Parallelism for Complex Media Applications. Technical Report UIUCDCS-R-2005-2605, UIUC CS, July 2005.Google Scholar
- A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and Array Contraction Across Arbitrarily Nested Loops Using Affine Partitioning. ACM SIGPLAN Notices, 36(7):103--112, July 2001. Google ScholarDigital Library
- Y. Lin. A Programmable Vector Coprocessor Architecture for Wireless Applications. In Proceedings of the 3rd Workshop on Application Specific Processors, Sept. 2004.Google Scholar
- M. Loghi and M. Pncino. Exploring Energy/Performance Tradeoffs in Shared Memory MPSoCs: Snoop-Based Cache Coherence vs. Software Solutions. In Proceedings of the Design Automation and Test in Europe Conf., Mar. 2005. Google ScholarDigital Library
- E. Machnicki. Ultra High Performance Scalable DSP Family for Multimedia. In Conf. Record of Hot Chips 17, Stanford, CA, Aug. 2005.Google Scholar
- K. Mai et al. Smart Memories: a Modular Reconfigurable Architecture. In Proceedings of the 27th Intl. Symp. on Computer Architecture, June 2000. Google ScholarDigital Library
- MIPS32 Architecture For Programmers Volume II: The MIPS32 Instruction Set. MIPS Technologies, Inc., 2001.Google Scholar
- A. Moshovos. RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In Proceedings of the 32nd Intl. Symp. on Computer Architecture, June 2005. Google ScholarDigital Library
- K. Sankaralingam. TRIPS: A Polymorphous Architecture for Exploiting ILP, TLP, and DLP. ACM Trans. Archit. Code Optim., 1(1):62--93, Mar. 2004. Google ScholarDigital Library
- J. Suh et al. A Performance Analysis of PIM, Stream Processing, and Tiled Processing on Memory-intensive Signal Processing Kernels. In Proceedings of the 30th Intl. Symp. on Computer Architecture, June 2003. Google ScholarDigital Library
- D. Tarjan, S. Thoziyoor, and N. P. Jouppi. CACTI 4.0. Technical Report HPL-2006-86, HP Labs, 2006.Google Scholar
- M. Taylor et al. Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams. In Proceedings of the 31st Intl. Symp. on Computer Architecture, May 2004. Google ScholarDigital Library
- Tensilica Software Tools. http://www.tensilica.com/products/software.htm.Google Scholar
- S. P. VanderWiel and D. J. Lilja. Data Prefetch Mechanisms. ACM Computing Surveys, 32(2):174--199, 2000. Google ScholarDigital Library
- D. Wang et al. DRAMsim: A Memory-System Simulator. SIGARCH Computer Architecture News, 33(4), 2005. Google ScholarDigital Library
- Z. Wang et al. Using the Compiler to Improve Cache Replacement Decisions. In Proceedings of the Conf. on Parallel Architectures and Compilation Techniques, Sept. 2002. Google ScholarDigital Library
- Z. Wang et al. Guided Region Prefetching: A Cooperative Hardware/Software Approach. In Proceedings of the 30th Intl. Symp. on Computer Architecture, June 2003. Google ScholarDigital Library
- T.-Y. Yeh. The Low-Power High-Performance Architecture of the PWRficient Processor Family. In Conf. Record of Hot Chips 17, Stanford, CA, Aug. 2005.Google Scholar
Index Terms
- Comparing memory systems for chip multiprocessors
Recommendations
Comparing memory systems for chip multiprocessors
There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, ...
Comparative evaluation of memory models for chip multiprocessors
There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Comments