research-article

Free Access

Comparative evaluation of memory models for chip multiprocessors

Authors:
Jacob Leverich

Stanford University

Stanford University
View Profile

,
Hideho Arakida

Stanford University

Stanford University
View Profile

,
Alex Solomatnikov

Stanford University

Stanford University
View Profile

,
Amin Firoozshahian

Stanford University

Stanford University
View Profile

,
Mark Horowitz

Stanford University

Stanford University
View Profile

,
Christos Kozyrakis

Stanford University

Stanford University
View Profile

ACM Transactions on Architecture and Code Optimization Volume 5 Issue 3Article No.: 12pp 1–30https://doi.org/10.1145/1455650.1455651

Published:01 December 2008Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications on systems with up to 16 cores, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and nonallocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.

References

Adve, S. V. and Gharachorloo, K. 1996. Shared memory consistency models: A tutorial. IEEE Computer 29, 12 (Dec.), 66--76. Google ScholarDigital Library
Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D. 2000. Clock rate versus IPC: the end of the road for conventional microarchitectures. In Proceedings of the 27th International Symposium Computer Architecture. Google ScholarDigital Library
Ahn, J. et al. 2004. Evaluating the imagine stream architecture. In Proceedings of the 31st International Symposium Computer Architecture. Google ScholarDigital Library
Andrews, J. and Backer, N. 2005. Xbox360 system architecture. In Conference Record of Hot Chips 17. Stanford, CA.Google Scholar
Barroso, L. A. et al. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th International Symposium on Computer Architecture. Google ScholarDigital Library
Chen, Y.-K., Li, E. Q., Zhou, X., and Ge, S. 2006. Implementation of h.264 encoder and decoder on personal computers. J. Visual Communication and Image Representation 17, 2, 509--532.Google ScholarCross Ref
Chiueh, T. 1993. A generational algorithm to multiprocessor cache coherence. In International Conference on Parallel Processing. 20--24. Google ScholarDigital Library
Culler, D., Singh, J. P., and Gupta, A. 1999. Parallel Computer Architecture: A Hardware/Software Approach. st. Louis: Morgan Kauffman. Google ScholarDigital Library
Dally, W. et al. 2003. Merrimac: Supercomputing with Streams. In Proceedings of the 2003 Conference on Supercomputing. Google ScholarDigital Library
Davis, J. D., Laudon, J., and Olukotun, K. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Drake, M., Hoffmann, H., Rabbah, R., and Amarasinghe, S. 2006. Mpeg-2 decoding in a stream programming language. In Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium, Rhodes Island (IPDPS). Google ScholarDigital Library
Eatherton, W. 2005. The push of network processing to the top of the pyramid. Keynote presentation at the Symposium on Architectures for Networking and Communication Systems, Princeton, NJ.Google Scholar
Erez, M., Ahn, J. H., Gummaraju, J., Rosenblum, M., and Dally, W. J. 2007. Executing irregular scientific applications on stream architectures. In Proceedings of the 21st Annual International Conference on Supercomputing. 93--104. Google ScholarDigital Library
Fatahalian, K., Knight, T. J., Houston, M. et al. 2006. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. Google ScholarDigital Library
Foley, T. and Sugerman, J. 2005. KD-tree acceleration structures for a GPU raytracer. In Proceedings of the Graphics Hardware Conference Google ScholarDigital Library
Gordon, M. I. et al. 2002. A stream compiler for communication-exposed architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
Gschwind, M. et al. 2005. A novel SIMD architecture for the cell heterogeneous chip-multiprocessor. In Conference Record of Hot Chips 17.Google Scholar
Gummaraju, J., Coburn, J., Turner, Y., and Rosenblum, M. 2008. Streamware: programming general-purpose multicore processors using streams. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 297--307. Google ScholarDigital Library
Gummaraju, J., Erez, M., Coburn, J., Rosenblum, M., and Dally, W. J. 2007. Architectural support for the stream execution model on general-purpose processors. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. 3--12. Google ScholarDigital Library
Gummaraju, J. and Rosenblum, M. 2005. Stream programming on general-purpose processors. In Proceedings of the 38th International Symposium on Microarchitecture. Google ScholarDigital Library
Havran, V. 2002. Heuristic ray shooting algorithms. Ph.D. thesis, Czech Technical University in Prague.Google Scholar
Heinlein, J., Gharachorloo, K., Dresser, S., and Gupta, A. 1994. Integration of message passing and shared memory in the stanford flash multiprocessor. SIGOPS Oper. Syst. Rev. 28, 5, 38--50.Google ScholarDigital Library
Ho, R., Mai, K., and Horowitz, M. 2001. The Future of wires. Proceedings of the IEEE 89, 4 (Apr.).Google ScholarCross Ref
Ho, R., Mai, K., and Horowitz, M. 2003. Efficient on-chip global interconnects. In Symposium on VLSI Circuits. 271--274.Google Scholar
Horowitz, M. and Dally, W. 2004. How scaling will change processor architecture. In Proceedings of the International Solid-State Circuits Conference. 132--133.Google Scholar
Independent JPEG Group. 1998. IJG's JPEG Software Release 6b.Google Scholar
ITU-T Rec. H.264. 2002. ISO/IEC 144496-10 AVC. 2002.Google Scholar
Jani, D., Ezer, G., and Kim, J. 2004. Long words and wide ports: Reinventing the Configurable Processor. In Proceedings of the Conference Record of Hot Chips 16. Stanford, CA.Google Scholar
Jayasena, N. 2005. Memory hierarchy design for steram computing. Ph.D. thesis, Stanford University. Google ScholarDigital Library
Khailany, B., Williams, T., Lin, J., Long, E., Rygh, M., Tovey, D., and Dally, W. 2008. A programmable 512 gops stream processor for signal, image, and video processing. IEEE Journal of Solid-State Circuits 43, 1, 202--213.Google ScholarCross Ref
Klaiber, A. C. and Levy, H. M. 1994. A comparison of message passing and shared memory architectures for data parallel programs. In Proceedings of the 21th International Symposium on Computer Architecture. Google ScholarDigital Library
Kongetira, P. 2004. A 32-way Multithreaded sparc processor. In Proceedings of the Conference Record of Hot Chips.Google Scholar
Kranz, D., Johnson, K., Agarwal, A., Kubiatowicz, J., and Lim, B.-H. 1993. Integrating message-passing and shared-memory: early experience. In Proceedings of the 4th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming. 54--63. Google ScholarDigital Library
Kumar, R., Zyuban, V., and Tullsen, D. M. 2005. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In Proceedings of the 32nd International Symposium on Computer Architecture. Google ScholarDigital Library
Leverich, J., Arakida, H., Solomatnikov, A., Firoozshahian, A., Horowitz, M., and Kozyrakis, C. 2007. Comparing memory systems for chip multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture. 358--368. Google ScholarDigital Library
Lewis, B. and Berg, D. J. 1998. Multithreaded Programming with Pthreads. Upper saddle River. NJ: Prentice Hall. Google ScholarDigital Library
Li, M. et al. 2005. ALP: efficient support for all levels of parallelism for complex Media applications. Tech. Rep. UIUCDCS-R-2005-2605, UIUC CS. July.Google Scholar
Lim, A. W., Liao, S.-W., and Lam, M. S. 2001. Blocking and array contraction across arbitrarily nested loops using affine partitioning. ACM SIG-PLAN Notices 36, 7, 103--112. Google ScholarDigital Library
Lin, Y. 2004. A programmable Vector coprocessor architecture for wireless applications. In Proceedings of the 3rd Workshop on Application Specific Processors.Google Scholar
Loghi, M. and Pncino, M. 2005. Exploring energy/performance tradeoffs in shared memory MPSoCs: Snoop-based cache coherence vs. software solutions. In Proceedings of the Design Automation and Test in Europe Conference Google ScholarDigital Library
Machnicki, E. 2005. Ultra high performance scalable DSP family for multimedia. In Proceedings of the Conference Record of Hot Chips 17.Google ScholarCross Ref
Mai, K. et al. 2000. Smart memories: A modular reconfigurable architecture. In Proceedings of the 27th International Symposium on Computer architecture. Google ScholarDigital Library
MIPS32 2001. MIPS32 Architecture For Programmers Volume II: The MIPS32 Instruction Set. MIPS Technologies, Inc.Google Scholar
Moshovos, A. 2005. Regionscout: Exploiting coarse grain sharing in snoop-based coherence. In Proceedings of the 32nd International Symposium on Computer Architecture. Google ScholarDigital Library
MPEG Software Simulation Group. Mssg mpeg2 encoder and decoder. Available at: http://www.mpeg.org/MPEG/MSSG/.Google Scholar
Sankaralingam, K. 2004. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. ACM Trans. Archit. Code Optim. 1, 1, 62--93. Google ScholarDigital Library
Suh, J. et al. 2003. A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels. In Proceedings of the 30th International Symposium on Computer Architecture. Google ScholarDigital Library
Tarjan, D., Thoziyoor, S., and Jouppi, N. P. 2006. CACTI 4.0. Tech. Rep. HPL-2006-86, HP Labs.Google Scholar
Taylor, M. et al. 2004. Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proceedings of the 31st International Symposium on Computer Architecture. Google ScholarDigital Library
Tensilica 2007. Tensilica Software Tools. http://www.tensilica.com/products/software.htm.Google Scholar
VanderWiel, S. P. and Lilja, D. J. 2000. Data prefetch mechanisms. ACM Computing Surveys 32, 2, 174--199. Google ScholarDigital Library
Wang, D. et al. 2005. DRAMsim: A memory-system simulator. SIGARCH Computer Architecture News 33, 4. Google ScholarDigital Library
Wang, Z. et al. 2002. Using the compiler to improve cache replacement decisions. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Wang, Z. et al. 2003. Guided region prefetching: a cooperative hardware/software approach. In Proceedings of the 30th International Symposium on Computer Architecture. Google ScholarDigital Library
Yeh, T.-Y. 2005. The low-power high-performance architecture of the PWRficient processor family. In Proceedings of the Conference Record of Hot Chips 17.Google Scholar

Index Terms

Comparative evaluation of memory models for chip multiprocessors

Recommendations

Comparing memory systems for chip multiprocessors
ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture

There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, ...
Read More
Comparing memory systems for chip multiprocessors

There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, ...
Read More
Improving support for locality and fine-grain sharing in chip multiprocessors
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Volume 5, Issue 3
November 2008
102 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/1455650
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2008
- Accepted: 1 July 2008
- Revised: 1 June 2008
- Received: 1 March 2007
Published in taco Volume 5, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chip multiprocessors
cache coherence
locality optimizations
parallel programming
streaming memory
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 1,139
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Comparing memory systems for chip multiprocessors

Comparing memory systems for chip multiprocessors

Improving support for locality and fine-grain sharing in chip multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Comparing memory systems for chip multiprocessors

Comparing memory systems for chip multiprocessors

Improving support for locality and fine-grain sharing in chip multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media