article

Free Access

Multigrain shared memory

Authors:
Donald Yeung

Univ. of Maryland at College Park, College Park

Univ. of Maryland at College Park, College Park
View Profile

,
John Kubiatowicz

Univ. of California at Berkeley, Berkeley

Univ. of California at Berkeley, Berkeley
View Profile

,
Anant Agarwal

Massachusetts Institute of Technology, Cambridge

Massachusetts Institute of Technology, Cambridge
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 18 Issue 2pp 154–196https://doi.org/10.1145/350853.350871

Published:01 May 2000Publication History

ACM Transactions on Computer Systems

Abstract

Parallel workstations, each comprising tens of processors based on shared memory, promise cost-effective scalable multiprocessing. This article explores the coupling of such small- to medium-scale shared-memory multiprocessors through software over a local area network to synthesize larger shared-memory systems. We call these systems Distributed Shared-memory MultiProcessors (DSMPs). This article introduces the design of a shared-memory system that uses multiple granularities of sharing, called MGS, and presents a prototype implementation of MGS on the MIT Alewife multiprocessor. Multigrain shared memory enables the collaboration of hardware and software shared memory, thus synthesizing a single transparent shared-memory address space across a cluster of multiprocessors. The system leverages the efficient support for fine-grain cache-line sharing within multiprocessor nodes as often as possible, and resorts to coarse-grain page-level sharing across nodes only when absolutely necessary. Using our prototype implementation of MGS, an in-depth study of several shared-memory application is conducted to understand the behavior of DSMPs. Our study is the first to comprehensively explore the DSMP design space, and teh compare the performance of DSMPs against all-software and all-hardware DSMs on a signle experimental platform. Keeping the total number of processors fixed, we show that applications execute up to 85% faster on a DSMP as compared to an all-software DSM. We also show that all-hardware DSMs hold a significant performance advantage over DSMPs on challenging applications, between 159% and 1014%. However, program transformations to improve data locality for these applications allow DSMPs to almost match the performance of an all-hardware multiprocessor of the same size.

References

AGARWAL, A., BIANCHINI, R., CHAIKEN, D., JOHNSON, K. L., KRANZ, D., KUBIATOWICZ, J., BENG-HONG, L., MACKENZIE, K., AND YEUNG, D. 1995. The MIT Alewife machine: Architecture and performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA '95, Santa Margherita Ligure, Italy, June 22-24), D. A. Patterson, Ed. ACM Press, New York, NY, 2-13.]] Google Scholar
BERSHAD, B. N. AND ZEKAUSKAS, M.J. 1991. Midway: Shared memory parallel programming with entry consistency for distributed memory multiprocessors. CMU-CS-91-170. Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.]]Google Scholar
BLACK, D., RASHID, R. F., GOLUB, D. B., HILL, C. R., AND BARON, R.V. 1989. Translation lookaside buffer consistency: A software approach. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III, Boston, MA, Apr. 3-6), J. Emer, Ed. ACM Press, New York, NY, 113-122.]] Google Scholar
CARTER, J. B., BENNETT, J. K., AND ZWAENEPOEL, W. 1991. Implementation and performance of Munin. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP '91, Pacific Grove, CA, Oct. 13-16), H. M. Levy, Ed. ACM Press, New York, NY, 152-164.]] Google Scholar
CHAIKEN, D., KUBIATOWICZ, J., AND AGARWAL, A. 1991. LimitLESS directories: A scalable cache coherence scheme. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV, Santa Clara, CA, Apr. 8-11), D. A. Patterson, Ed. ACM Press, New York, NY, 224-234.]] Google Scholar
Cox, A. L. AND FOWLER, R.g. 1989. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. Tech. Rep. 263. Dept. of Computer Science, University of Rochester, Rochester, NY.]]Google Scholar
Cox, A. L., DWARKADAS, S., KELEHER, P., Lu, H., RAJAMONY, R., AND ZWAENEPOEL, W. 1994. Software versus hardware shared-memory implementation: A case study. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA '94, Chicago, IL, Apr. 18-21), D. A. Patterson, Ed. IEEE Computer Society Press, Los Alamitos, CA, 106-117.]] Google Scholar
ERLICHSON, A., NUCKOLLS, N., CHESSON, G., AND HENNESSY, J. 1996. SoftFLASH: Analyzing the performance of clustered distributed virtual shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII, Cambridge, MA, Oct. 1-5, 1996), B. Dally and S. Eggers, Eds. ACM Press, New York, NY, 210-221.]] Google Scholar
FALSAFI, B. AND WOOD, D.A. 1997. Scheduling communication on an SMP node parallel machine. In Proceedings of the International Symposium on High Performance Computer Architecture (Feb.), IEEE Press, Piscataway, NJ.]] Google Scholar
GILLETT, R. 1996. Memory channel: An optimzed cluster interconnect. IEEE Micro 16, 2 (Apr.).]]Google Scholar
JOHNSON, K., KAASHOEK, F., AND WALLACH, D. 1995. CRL: High-performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SIGOPS '95, Copper Mountain Resort, CO, Dec. 3-6), M. B. Jones, Ed. ACM Press, New York, NY.]] Google Scholar
KARLSSON, M. AND STENSTR M, P. 1996. Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers. In Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture, IEEE Press, Piscataway, NJ.]] Google Scholar
KELEHER, P., COX, A. L., AND ZWAENEPOEL, W. 1992. Lazy release consistency for software distributed shared memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA '92, Gold Coast, Australia, May 19-21), A. Gottlieb, Ed. ACM Press, New York, NY, 13-22.]] Google Scholar
KELEHER, P., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1994. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter 1994 USENIX Conference (Jan.), USENIX Assoc., Berkeley, CA, 115-131.]] Google Scholar
KRANZ, D., JOHNSON, K., AGARWAL, A., KUBIATOWICZ, J., AND LIM, B.-H. 1993. Integrating message-passing and shared-memory: Early experience. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP, San Diego, CA, May 19-22), M. Chen and R. Halstead, Eds. ACM Press, New York, NY, 54-63.]] Google Scholar
KUBIATOWICZ, J. AND AGARWAL, A. 1993. Anatomy of a message in the Alewife multiprocessor. In Proceedings of the 1993 International Conference on Supercomputing (ICS '93, Tokyo, Japan, July 20-22), Y. Muraoka, Ed. ACM Press, New York, NY, 195-206.]] Google Scholar
KUSKIN, J., OFELT, D., HEINRICH, M., HEINLEIN, J., SIMONI, R., GHARACHORLOO, K., CHAPIN, J., NAKAHIRA, D., BAXTER, J., HOROWITZ, M., GUPTA, A., ROSENBLUM, M., AND HENNESSY, J. 1994. The Stanford FLASH multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA '94, Chicago, IL, Apr. 18-21), D. A. Patterson, Ed. IEEE Computer Society Press, Los Alamitos, CA, 302-313.]] Google Scholar
LI, K. AND HUDAK, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov. 1989), 321-359.]] Google Scholar
LAUDON, J. AND LENOSKI, D. 1997. The SGI Origin: A ccNUMA highly scalable server. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA '97, Denver, CO, June 2-4), A. R. Pleszkun and T. Mudge, Eds. ACM Press, New York, NY, 241-251.]] Google Scholar
MUKHERJEE, S., SHARMA, S., HILL, M., LARUS, J., ROGERS, A., AND SALTZ, J. 1995. Efficient support for irregular applications on distributed-memory machines. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP '95, Santa Barbara, CA, July 19-21), J. Ferrante, D. Padua, and R. L. Wexelblat, Eds. ACM Press, New York, NY, 68-79.]] Google Scholar
ROSENBURG, B. 1989. Low-synchronization translation lookaside buffer consistency in large-scale shared-memory multiprocessors. ACM SIGOPS Oper. Syst. Rev. 23, 5 (Dec. 3-6, 1989), 137-146.]] Google Scholar
SALTZ, J. H., MIRCHANDANEY, R., AND CROWLEY, K. 1991. Run-time parallelization and scheduling of loops. IEEE Trans. Comput. 40, 5 (May 1991), 603-612.]] Google Scholar
SAMANTA, R., BILAS, A., IFTODE, L., AND SINGH, J. P. 1998. Home-based SVM protocols for SMP clusters: Design and performance. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture (Las Vegas, NV, Feb.), IEEE Press, Piscataway, NJ.]] Google Scholar
SCALES, D., GHARACHORLOO, K., AND THEKKATH, C.A. 1996. Shasta: A low-overhead, software-only approach for supporting fine-grain shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII, Cambridge, MA, Oct. 1-5, 1996), B. Dally and S. Eggers, Eds. ACM Press, New York, NY, 174-185.]] Google Scholar
SCALES, D. J., GHARACHORLOO, K., AND AGGARWAL, A. 1997. Fine-grain software distributed shared memory on SMP clusters. In Proceedings of the International Symposium on High Performance Computer Architecture (Feb.), IEEE Press, Piscataway, NJ.]] Google Scholar
SINGH, J. P., HOLT, C., TOTSUKA, T., GUPTA, A., AND HENNESSY, J. L. 1992a. Load balancing and data locality in hierarchical N-body methods. Tech. Rep. CSL-TR-92-505. Computer Systems Laboratory, Stanford Univ., Stanford, CA.]]Google Scholar
SINGH, J. P., WEBER, W.-D., AND GUPTA, A. 1992b. SPLASH: Stanford Parallel Applications for Shared-Memory. Tech. Rep. CSL-TR-92-526. Computer Systems Laboratory, Stanford Univ., Stanford, CA.]] Google Scholar
STETS, R., DWARKADAS, S., HARDAVELLAS, N., HUNT, G., KONTOTHANASSIS, L., PARTHASARATHY, S., AND SCOTT, M. 1997. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP '97, Saint-Malo, France, Oct. 5-8, 1997), M. Ban tre, H. Levy, and W. M. Waite, Eds. ACM Press, New York, NY.]] Google Scholar
SUN MICROSYSTEMS. 1996. The Ultra Enterpise 1 and 2 server architecture. Sun Microsysterns, Inc., Mountain View, CA.]]Google Scholar
TELLER, P. J. AND SNIR, M. 1988. TLB consistency on highly parallel shared-memory multiprocessors. In Proceedings o the 21st Annual Hawaii International Conference on System Sciences (HICSS '88), 184-193.]] Google Scholar
VON EICKEN, T., BASU, A., BUCH, V., AND VOGELS, W. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SIGOPS '95, Copper Mountain Resort, CO, Dec. 3-6), M. B. Jones, Ed. ACM Press, New York, NY.]] Google Scholar
YEUNG, D. 1998. Multigrain shared memory. Ph.D. Dissertation. Department of Computer Science, MIT, Cambridge, MA.]] Google Scholar
YEUNG, D., KUBIATOWICZ, J., AND AGARWAL, A. 1996. MGS: A multigrain shared memory system. In Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96, Philadelphia, PA, May 22-24, 1996), J.-L. Baer, Ed. ACM Press, New York, NY.]] Google Scholar
ZHOU, Y., IFTODE, L., AND LI, K. 1996. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI '96, Seattle, WA, Oct. 28-31), K. Petersen and W. Zwaenepoel, Eds. ACM Press, New York, NY.]] Google Scholar

Index Terms

Multigrain shared memory

Recommendations

Multigrain shared memory
Read More
High-performance all-software distributed shared memory
Read More
A Framework for Exploiting Task and Data Parallelism on Distributed Memory Multicomputers

Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all ...
Read More

Reviews

Reviewer: Farnaz Mounes-Toussi

The paper describes a distributed shared-memory multiprocessor system in which each node is a multiprocessor with hardware support for cache coherence. Nodes are connected through a Local Area Network and cache coherence is supported by software between the nodes. This arrangement allows the programs to take advantage of fine grain sharing at the cache block level and coarse grain sharing at the page level. Their performance studies indicate that this system architecture is promising in most cases. For some of the programs compile-time analysis is required to transform the program to improve data locality. Although their experimental platform is not a true cluster of multiprocessors, they have made every attempt to account for the resulting inaccuracies. The authors provide excellent background information and the paper is suitable for those with limited knowledge in this particular area. Overall, their study and the proposed ideas are well suited for the near future clusters.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 18, Issue 2
May 2000
108 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/350853
Issue’s Table of Contents

Copyright © 2000 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2000
Published in tocs Volume 18, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed memory
symmetric multiprocessors
system of systems
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 814
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multigrain shared memory

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Multigrain shared memory

High-performance all-software distributed shared memory

A Framework for Exploiting Task and Data Parallelism on Distributed Memory Multicomputers

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multigrain shared memory

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Multigrain shared memory

High-performance all-software distributed shared memory

A Framework for Exploiting Task and Data Parallelism on Distributed Memory Multicomputers

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media