|
ABSTRACT
Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs supporting an efficient inter-node coherence protocol implemented in software. We argue that our primitives and the example design presented in this paper have lower hardware overhead, have easier (and later) verification requirements, and provide the opportunity for flexible coherence protocols and simpler protocol bug corrections than traditional designs. Our evaluation is based on detailed full-system simulations of modern chip-multiprocessors and both commercial and HPC workloads. We compare a low-complexity system based on the proposed primitives with aggressive hardware multi-chip shared-memory systems and show that the performance is competitive across a large design space.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
|
 |
3
|
John B. Carter , John K. Bennett , Willy Zwaenepoel, Implementation and performance of Munin, Proceedings of the thirteenth ACM symposium on Operating systems principles, p.152-164, October 13-16, 1991, Pacific Grove, California, United States
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
Gharachorloo, K., et al. Efficient ECC-Based Directory Implementations for Scalable Multiprocessors. In Computer Architecture and High-Performance Computing (Oct. 2000).
|
| |
8
|
Hagersten, E., et al. Simple COMA Node Implementations. In HICSS (Jan. 1994).
|
| |
9
|
|
 |
10
|
Mark Horowitz , Margaret Martonosi , Todd C. Mowry , Michael D. Smith, Informing memory operations: providing memory performance feedback in modern processors, Proceedings of the 23rd annual international symposium on Computer architecture, p.260-270, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
11
|
|
| |
12
|
|
| |
13
|
Krewell, K. Power5 Tops on Bandwidth. In Microprocessor Report (Dec. 2003).
|
 |
14
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21st annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
15
|
|
 |
16
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Anoop Gupta , John Hennessy, The directory-based cache coherence protocol for the DASH multiprocessor, Proceedings of the 17th annual international symposium on Computer Architecture, p.148-159, May 28-31, 1990, Seattle, Washington, United States
|
| |
17
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Wolf-Dietrich Weber , Anoop Gupta , John Hennessy , Mark Horowitz , Monica S. Lam, The Stanford Dash Multiprocessor, Computer, v.25 n.3, p.63-79, March 1992
[doi> 10.1109/2.121510
]
|
 |
18
|
|
| |
19
|
Peter S. Magnusson , Magnus Christensson , Jesper Eskilson , Daniel Forsgren , Gustav Hållberg , Johan Högberg , Fredrik Larsson , Andreas Moestedt , Bengt Werner, Simics: A Full System Simulation Platform, Computer, v.35 n.2, p.50-58, February 2002
[doi> 10.1109/2.982916
]
|
 |
20
|
Milo M. K. Martin , Pacia J. Harper , Daniel J. Sorin , Mark D. Hill , David A. Wood, Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
| |
21
|
Nowatzyk, A., et al. The S3.mp Scalable Shared Memory Multiprocessor. In ICPP (Aug. 1995), vol. I.
|
 |
22
|
Kunle Olukotun , Basem A. Nayfeh , Lance Hammond , Ken Wilson , Kunyung Chang, The case for a single-chip multiprocessor, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.2-11, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
23
|
OpenSPARC.net, June 2006. Available from http://www.opensparc.net.
|
| |
24
|
|
 |
25
|
Steven K. Reinhardt , Robert W. Pfile , David A. Wood, Decoupled hardware support for distributed shared memory, Proceedings of the 23rd annual international symposium on Computer architecture, p.34-43, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
26
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21st annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
 |
27
|
Ioannis Schoinas , Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , James R. Larus , David A. Wood, Fine-grain access control for distributed shared memory, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.297-306, October 05-07, 1994, San Jose, California, United States
|
| |
28
|
Standard Performance Evaluation Corporation. SPECjbb2000. A Java Business Benchmark. White Paper.
|
| |
29
|
Tendler, J. M., et al. Power4 system microarchitecture. IBM Journal of Research and Development 46, 1 (Jan. 2002).
|
| |
30
|
|
 |
31
|
|
| |
32
|
Wallin, D., et al. Vasa: A Simulator Infrastructure with Adjustable Fidelity. In PDCS (Nov. 2005).
|
| |
33
|
Weaver, D. L., and Germond, T., Eds.The SPARC Architecture Manual, Version 9. PTR, Prentice Hall, 2000.
|
 |
34
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
 |
35
|
Håkan Zeffer , Zoran Radović , Martin Karlsson , Erik Hagersten, TMA: a trap-based memory architecture, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
[doi> 10.1145/1183401.1183438]
|
|