ACM Home Page
Please provide us with feedback. Feedback
Efficient address remapping in distributed shared-memory systems
Full text PdfPdf (414 KB)
Source ACM Transactions on Architecture and Code Optimization (TACO) archive
Volume 3 ,  Issue 2  (June 2006) table of contents
Pages: 209 - 229  
Year of Publication: 2006
ISSN:1544-3566
Authors
Lixin Zhang  IBM Austin Research Lab, Austin, TX
Mike Parker  Cray Inc.
John Carter  University of Utah, Salt Lake City, UT
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 89,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1138035.1138039
What is a DOI?

ABSTRACT

As processor performance continues to improve at a rate much higher than DRAM and network performance, we are approaching a time when large-scale distributed shared memory systems will have remote memory latencies measured in tens of thousands of processor cycles. The Impulse memory system architecture adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to control how data is accessed and cached and thereby improve cache and bus utilization and reduce the number of memory accesses required. Previous Impulse work focuses on uniprocessor systems and relies on software to flush processor caches when necessary to ensure data coherence. In this paper, we investigate an extension of Impulse to multiprocessor systems that extends the coherence protocol to maintain data coherence without requiring software-directed cache flushing. Specifically, the multiprocessor Impulse controller can gather/scatter data across the network while its coherence protocol guarantees that each gather request gets coherent data and each scatter request updates every coherent replica in the system. Our simulation results demonstrate that the proposed system can significantly outperform conventional systems, achieving an average speedup of 9X on four memory-bound benchmarks on a 32-processor system.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
AAE Corp. 2000. DIS Stressmark Suite. AAE Corp.
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
Graybill, R. 2002. High productivity computing systems. http://www.darpa.mil/DARPATech2002/presentations/ipto_pdf/speeches/GRAYBILL.pdf.
10
 
11
 
12
13
14
15
 
16
MIPS 2002. MIPS R18000 Microprocessor User's Manual, Version 2.0. MIPS.
 
17
Moore, G. 1965. Moore's law. http://www.intel.com/research/silicon/mooreslaw.htm.
18
 
19
 
20
21
 
22
SGI. 2001. SN2-MIPS Communication Protocol Specification, Revision 0.12. SGI.
 
23
SGI. 2002. Orbit Functional Specification, Vol. 1, Revision 0.1. SGI.
24
 
25
Sunaga, T., Kogge, P. M. et al. 1996. A processor in memory chip for massively parallel embedded applicatiions. IEEE Journal of Solid State Circuits, 1556--1559.
26
 
27
28
 
29
Zhang, L. 2000. URSIM reference manual. Tech. Rep. UUCS-00-015, University of Utah. August.
 
30
 
31

Collaborative Colleagues:
Lixin Zhang: colleagues
Mike Parker: colleagues
John Carter: colleagues