skip to main content
article
Free Access

Multigrain shared memory

Published:01 May 2000Publication History
Skip Abstract Section

Abstract

Parallel workstations, each comprising tens of processors based on shared memory, promise cost-effective scalable multiprocessing. This article explores the coupling of such small- to medium-scale shared-memory multiprocessors through software over a local area network to synthesize larger shared-memory systems. We call these systems Distributed Shared-memory MultiProcessors (DSMPs). This article introduces the design of a shared-memory system that uses multiple granularities of sharing, called MGS, and presents a prototype implementation of MGS on the MIT Alewife multiprocessor. Multigrain shared memory enables the collaboration of hardware and software shared memory, thus synthesizing a single transparent shared-memory address space across a cluster of multiprocessors. The system leverages the efficient support for fine-grain cache-line sharing within multiprocessor nodes as often as possible, and resorts to coarse-grain page-level sharing across nodes only when absolutely necessary. Using our prototype implementation of MGS, an in-depth study of several shared-memory application is conducted to understand the behavior of DSMPs. Our study is the first to comprehensively explore the DSMP design space, and teh compare the performance of DSMPs against all-software and all-hardware DSMs on a signle experimental platform. Keeping the total number of processors fixed, we show that applications execute up to 85% faster on a DSMP as compared to an all-software DSM. We also show that all-hardware DSMs hold a significant performance advantage over DSMPs on challenging applications, between 159% and 1014%. However, program transformations to improve data locality for these applications allow DSMPs to almost match the performance of an all-hardware multiprocessor of the same size.

References

  1. AGARWAL, A., BIANCHINI, R., CHAIKEN, D., JOHNSON, K. L., KRANZ, D., KUBIATOWICZ, J., BENG-HONG, L., MACKENZIE, K., AND YEUNG, D. 1995. The MIT Alewife machine: Architecture and performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA '95, Santa Margherita Ligure, Italy, June 22-24), D. A. Patterson, Ed. ACM Press, New York, NY, 2-13.]] Google ScholarGoogle Scholar
  2. BERSHAD, B. N. AND ZEKAUSKAS, M.J. 1991. Midway: Shared memory parallel programming with entry consistency for distributed memory multiprocessors. CMU-CS-91-170. Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.]]Google ScholarGoogle Scholar
  3. BLACK, D., RASHID, R. F., GOLUB, D. B., HILL, C. R., AND BARON, R.V. 1989. Translation lookaside buffer consistency: A software approach. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III, Boston, MA, Apr. 3-6), J. Emer, Ed. ACM Press, New York, NY, 113-122.]] Google ScholarGoogle Scholar
  4. CARTER, J. B., BENNETT, J. K., AND ZWAENEPOEL, W. 1991. Implementation and performance of Munin. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP '91, Pacific Grove, CA, Oct. 13-16), H. M. Levy, Ed. ACM Press, New York, NY, 152-164.]] Google ScholarGoogle Scholar
  5. CHAIKEN, D., KUBIATOWICZ, J., AND AGARWAL, A. 1991. LimitLESS directories: A scalable cache coherence scheme. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV, Santa Clara, CA, Apr. 8-11), D. A. Patterson, Ed. ACM Press, New York, NY, 224-234.]] Google ScholarGoogle Scholar
  6. Cox, A. L. AND FOWLER, R.g. 1989. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. Tech. Rep. 263. Dept. of Computer Science, University of Rochester, Rochester, NY.]]Google ScholarGoogle Scholar
  7. Cox, A. L., DWARKADAS, S., KELEHER, P., Lu, H., RAJAMONY, R., AND ZWAENEPOEL, W. 1994. Software versus hardware shared-memory implementation: A case study. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA '94, Chicago, IL, Apr. 18-21), D. A. Patterson, Ed. IEEE Computer Society Press, Los Alamitos, CA, 106-117.]] Google ScholarGoogle Scholar
  8. ERLICHSON, A., NUCKOLLS, N., CHESSON, G., AND HENNESSY, J. 1996. SoftFLASH: Analyzing the performance of clustered distributed virtual shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII, Cambridge, MA, Oct. 1-5, 1996), B. Dally and S. Eggers, Eds. ACM Press, New York, NY, 210-221.]] Google ScholarGoogle Scholar
  9. FALSAFI, B. AND WOOD, D.A. 1997. Scheduling communication on an SMP node parallel machine. In Proceedings of the International Symposium on High Performance Computer Architecture (Feb.), IEEE Press, Piscataway, NJ.]] Google ScholarGoogle Scholar
  10. GILLETT, R. 1996. Memory channel: An optimzed cluster interconnect. IEEE Micro 16, 2 (Apr.).]]Google ScholarGoogle Scholar
  11. JOHNSON, K., KAASHOEK, F., AND WALLACH, D. 1995. CRL: High-performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SIGOPS '95, Copper Mountain Resort, CO, Dec. 3-6), M. B. Jones, Ed. ACM Press, New York, NY.]] Google ScholarGoogle Scholar
  12. KARLSSON, M. AND STENSTR M, P. 1996. Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers. In Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture, IEEE Press, Piscataway, NJ.]] Google ScholarGoogle Scholar
  13. KELEHER, P., COX, A. L., AND ZWAENEPOEL, W. 1992. Lazy release consistency for software distributed shared memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA '92, Gold Coast, Australia, May 19-21), A. Gottlieb, Ed. ACM Press, New York, NY, 13-22.]] Google ScholarGoogle Scholar
  14. KELEHER, P., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1994. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter 1994 USENIX Conference (Jan.), USENIX Assoc., Berkeley, CA, 115-131.]] Google ScholarGoogle Scholar
  15. KRANZ, D., JOHNSON, K., AGARWAL, A., KUBIATOWICZ, J., AND LIM, B.-H. 1993. Integrating message-passing and shared-memory: Early experience. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP, San Diego, CA, May 19-22), M. Chen and R. Halstead, Eds. ACM Press, New York, NY, 54-63.]] Google ScholarGoogle Scholar
  16. KUBIATOWICZ, J. AND AGARWAL, A. 1993. Anatomy of a message in the Alewife multiprocessor. In Proceedings of the 1993 International Conference on Supercomputing (ICS '93, Tokyo, Japan, July 20-22), Y. Muraoka, Ed. ACM Press, New York, NY, 195-206.]] Google ScholarGoogle Scholar
  17. KUSKIN, J., OFELT, D., HEINRICH, M., HEINLEIN, J., SIMONI, R., GHARACHORLOO, K., CHAPIN, J., NAKAHIRA, D., BAXTER, J., HOROWITZ, M., GUPTA, A., ROSENBLUM, M., AND HENNESSY, J. 1994. The Stanford FLASH multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA '94, Chicago, IL, Apr. 18-21), D. A. Patterson, Ed. IEEE Computer Society Press, Los Alamitos, CA, 302-313.]] Google ScholarGoogle Scholar
  18. LI, K. AND HUDAK, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov. 1989), 321-359.]] Google ScholarGoogle Scholar
  19. LAUDON, J. AND LENOSKI, D. 1997. The SGI Origin: A ccNUMA highly scalable server. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA '97, Denver, CO, June 2-4), A. R. Pleszkun and T. Mudge, Eds. ACM Press, New York, NY, 241-251.]] Google ScholarGoogle Scholar
  20. MUKHERJEE, S., SHARMA, S., HILL, M., LARUS, J., ROGERS, A., AND SALTZ, J. 1995. Efficient support for irregular applications on distributed-memory machines. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP '95, Santa Barbara, CA, July 19-21), J. Ferrante, D. Padua, and R. L. Wexelblat, Eds. ACM Press, New York, NY, 68-79.]] Google ScholarGoogle Scholar
  21. ROSENBURG, B. 1989. Low-synchronization translation lookaside buffer consistency in large-scale shared-memory multiprocessors. ACM SIGOPS Oper. Syst. Rev. 23, 5 (Dec. 3-6, 1989), 137-146.]] Google ScholarGoogle Scholar
  22. SALTZ, J. H., MIRCHANDANEY, R., AND CROWLEY, K. 1991. Run-time parallelization and scheduling of loops. IEEE Trans. Comput. 40, 5 (May 1991), 603-612.]] Google ScholarGoogle Scholar
  23. SAMANTA, R., BILAS, A., IFTODE, L., AND SINGH, J. P. 1998. Home-based SVM protocols for SMP clusters: Design and performance. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture (Las Vegas, NV, Feb.), IEEE Press, Piscataway, NJ.]] Google ScholarGoogle Scholar
  24. SCALES, D., GHARACHORLOO, K., AND THEKKATH, C.A. 1996. Shasta: A low-overhead, software-only approach for supporting fine-grain shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII, Cambridge, MA, Oct. 1-5, 1996), B. Dally and S. Eggers, Eds. ACM Press, New York, NY, 174-185.]] Google ScholarGoogle Scholar
  25. SCALES, D. J., GHARACHORLOO, K., AND AGGARWAL, A. 1997. Fine-grain software distributed shared memory on SMP clusters. In Proceedings of the International Symposium on High Performance Computer Architecture (Feb.), IEEE Press, Piscataway, NJ.]] Google ScholarGoogle Scholar
  26. SINGH, J. P., HOLT, C., TOTSUKA, T., GUPTA, A., AND HENNESSY, J. L. 1992a. Load balancing and data locality in hierarchical N-body methods. Tech. Rep. CSL-TR-92-505. Computer Systems Laboratory, Stanford Univ., Stanford, CA.]]Google ScholarGoogle Scholar
  27. SINGH, J. P., WEBER, W.-D., AND GUPTA, A. 1992b. SPLASH: Stanford Parallel Applications for Shared-Memory. Tech. Rep. CSL-TR-92-526. Computer Systems Laboratory, Stanford Univ., Stanford, CA.]] Google ScholarGoogle Scholar
  28. STETS, R., DWARKADAS, S., HARDAVELLAS, N., HUNT, G., KONTOTHANASSIS, L., PARTHASARATHY, S., AND SCOTT, M. 1997. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP '97, Saint-Malo, France, Oct. 5-8, 1997), M. Ban tre, H. Levy, and W. M. Waite, Eds. ACM Press, New York, NY.]] Google ScholarGoogle Scholar
  29. SUN MICROSYSTEMS. 1996. The Ultra Enterpise 1 and 2 server architecture. Sun Microsysterns, Inc., Mountain View, CA.]]Google ScholarGoogle Scholar
  30. TELLER, P. J. AND SNIR, M. 1988. TLB consistency on highly parallel shared-memory multiprocessors. In Proceedings o the 21st Annual Hawaii International Conference on System Sciences (HICSS '88), 184-193.]] Google ScholarGoogle Scholar
  31. VON EICKEN, T., BASU, A., BUCH, V., AND VOGELS, W. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SIGOPS '95, Copper Mountain Resort, CO, Dec. 3-6), M. B. Jones, Ed. ACM Press, New York, NY.]] Google ScholarGoogle Scholar
  32. YEUNG, D. 1998. Multigrain shared memory. Ph.D. Dissertation. Department of Computer Science, MIT, Cambridge, MA.]] Google ScholarGoogle Scholar
  33. YEUNG, D., KUBIATOWICZ, J., AND AGARWAL, A. 1996. MGS: A multigrain shared memory system. In Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96, Philadelphia, PA, May 22-24, 1996), J.-L. Baer, Ed. ACM Press, New York, NY.]] Google ScholarGoogle Scholar
  34. ZHOU, Y., IFTODE, L., AND LI, K. 1996. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI '96, Seattle, WA, Oct. 28-31), K. Petersen and W. Zwaenepoel, Eds. ACM Press, New York, NY.]] Google ScholarGoogle Scholar

Index Terms

  1. Multigrain shared memory

          Recommendations

          Reviews

          Farnaz Mounes-Toussi

          The paper describes a distributed shared-memory multiprocessor system in which each node is a multiprocessor with hardware support for cache coherence. Nodes are connected through a Local Area Network and cache coherence is supported by software between the nodes. This arrangement allows the programs to take advantage of fine grain sharing at the cache block level and coarse grain sharing at the page level. Their performance studies indicate that this system architecture is promising in most cases. For some of the programs compile-time analysis is required to transform the program to improve data locality. Although their experimental platform is not a true cluster of multiprocessors, they have made every attempt to account for the resulting inaccuracies. The authors provide excellent background information and the paper is suitable for those with limited knowledge in this particular area. Overall, their study and the proposed ideas are well suited for the near future clusters.

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader