skip to main content
article
Free Access

Algorithms for scalable synchronization on shared-memory multiprocessors

Published:01 February 1991Publication History
Skip Abstract Section

Abstract

Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory.

We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction. We also present a new scalable barrier algorithm that generates 0(1) remote references per processor reaching the barrier, and observe that two previously-known barriers can likewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes.

We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that contention due to synchronization need not be a problem in large-scale shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called “dance hall” architectures, in which shared memory locations are equally far from all processors. —From the Authors' Abstract

References

  1. 1 ADVE, S. V., ANn HILL, M. D. Weak ordering--A new definition. In Proceedings of the International Symposium on Computer Architecture (May 1990), 2-14. Google ScholarGoogle Scholar
  2. 2 AGARWAL, A., AND CHEmAN, M. Adaptive backoff synchronization techniques. In Proceedings of the International Symposium on Computer Architecture (May 1989), 396 406. Google ScholarGoogle Scholar
  3. 3 AGARWAL, A., SIMONI, R., HENNESSY, J., AND HOROWITZ, M. An evaluation of directory schemes for cache coherence. In Proceedings of the International Symposium on Computer Architecture (June 1988), 280-289. Google ScholarGoogle Scholar
  4. 4 ALMASI, G. S., ANn GOTTLIEB, A. Highly Parallel Computing. Benjamin/Cummings, Redwood City, Calif., 1989. Google ScholarGoogle Scholar
  5. 5 ANDERSON, T.E. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distributed Syst. 1, I (Jan. 1990), 6-16. Google ScholarGoogle Scholar
  6. 6 ANDEaSON, T. E., L~ZOWSKA, E. D., AND LEVY, H. M. The performance implications of thread management alternatives for shared-memory multiprocessors IEEE Trans. Comput. 38, 12 (Dec. 1989), 1631-1644. Google ScholarGoogle Scholar
  7. 7 A~CnmALn, J. AND BAER, J.-L. An economical solution to the cache coherence problem. In Proceedings of the International Symposium on Computer Architecture (1984), 355-362. Google ScholarGoogle Scholar
  8. 8 BBN Laboratories. Butterfly parallel processor overview. Tech. Rep. 6148, version 1, BBN Laboratories, Cambridge, Mass, Mar. 1986.Google ScholarGoogle Scholar
  9. 9 BROOKS, E. D., III. The butterfly barrier Int. J Parallel Program. 15, 4 (1986), 295-307. Google ScholarGoogle Scholar
  10. 10 BZOOKS, E. D., III. The shared memory hypercube. Parallel Comput. 6 (1988), 235-245.Google ScholarGoogle Scholar
  11. 11 DAVIS, H, AND HENNESSY, J. Characterizing the synchronization behavior of parallel programs. In Proceedings of the ACM Conference on Parallel Programming: Experzence with Apphcat~ons, Languages and Systems (July 1988), 198-211. Google ScholarGoogle Scholar
  12. 12 DIJKSTRA, E Solution of a problem in concurrent programming and control. Commun. ACM. 8, 9 (Sept. 1965), 569. Google ScholarGoogle Scholar
  13. 13 GHARACHORLOO, K., LENOSKI, D., LAUDON, J., GmBONS, P., GUPTA, A., AND HENNESSY, J. L. Memory consistency and event ordering in scalable shared.memory multiprocessors. In Proceedlngs of the International Symposium on Computer Architecture (May 1990), 15-26. Google ScholarGoogle Scholar
  14. 14 GOODMAN, J.R, VERNON, M. K., AND WOES% P.J. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (Apr. 1989), 64-75 Google ScholarGoogle Scholar
  15. 15 GOOVMAN, J R., AND WOEST, P. J. The Wisconsin Multlcube: A new large-scale cache coherent multiprocessor. In Proceedings of the International Symposium on Computer Architecture (May 1988), 422-431. Google ScholarGoogle Scholar
  16. 16 GOTTLIEB, A. Scalability, combining and the NYU ultracomputer. Ohio State University Parallel Computing Workshop, Mar. 1990. Invited Lecture.Google ScholarGoogle Scholar
  17. 17 GOTTLIEB, A., GRISHMAN, R., KRUSKAL, C. P., McAuLIFFE, K. P., RUDOLPH, L., AND SNIR, M. The NYU Ultracomputer--Designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32, 2 (Feb. 1983), 175-189.Google ScholarGoogle Scholar
  18. 18 GRAUNI<E, G., AND THAKKAR, S. Synchronization algorithms for shared-memory multiprocessors. Computer 23, 6 (June 1990), 60-69 Google ScholarGoogle Scholar
  19. 19 HENSGEN, D , FINKEL, R , AND MANBER, U. Two algorithms for barrier synchroniztion. Int. J. Parallel Program. 17, 1 (1988) 1-17. Google ScholarGoogle Scholar
  20. 20 HERLIHY, M. A methodology for implementing highly concurrent data structures. In Proceed~ngs of the Second A CM Symposium on Principles and Practice of Parallel Programming (Mar 1990), 197-206. Google ScholarGoogle Scholar
  21. 21 JAYASIMHA, D. N. Distributed synchronizers. In Proceedings of the 1988 Internatwnal Conference on Parallel Processing (Aug. 1988), 23-27.Google ScholarGoogle Scholar
  22. 22 KRUSKAL, C.P., RUDOLPH, L., AND SNm, M. Efficient synchronization on multiprocessors with shared memory. In Proceedings of the F~fih ACM Symposium on Principles of Dzstributecl Computing (1986), 218-228. Google ScholarGoogle Scholar
  23. 23 KUMAR, M., AND PFISTER, G.F. The onset of hot spot contention In Proceedings of the 1986 International Conference on Parallel Processing (1986), 28-34.Google ScholarGoogle Scholar
  24. 24 LAMPORT, L. A new solution of Dijkstra's concurrent programming problem. Commun. ACM 17, 8 (Aug. 1974), 453-455. Google ScholarGoogle Scholar
  25. 25 LAMPORT, L. The mutual exclusion problem: Part I--A theory of interprocess communication: Part II--Statement and solutions. J ACM 33, 2 (Apr. 1986), 313-348. Google ScholarGoogle Scholar
  26. 26 LAMPORT, L. A fast mutual exclusion algorithm. ACM Trans. Comput. Syst. 5, 1 (Feb. 1987) 1-11. Google ScholarGoogle Scholar
  27. 27 LEE, C.A. Barrier synchronization over multistage interconnection networks. In Proceedrags of the Second IEEE Symposium on Parallel and Distributed Processing (Dallas, Tex, Dec 1990), 130-133.Google ScholarGoogle Scholar
  28. 28 LEE, J., AND RAMACHANDRAN, U. Synchronization with multiprocessor caches. In Proceedrags of the Internatwnal Symposlum on Computer Architecture (May 1990), 27-37. Google ScholarGoogle Scholar
  29. 29 LUBACHEVSK~, B. An approach to automating the verification of compact parallel coordination programs. I Acta Inf. 21 (1984), 125-169.Google ScholarGoogle Scholar
  30. 30 LUBACnEVSKY, B. Synchronization barrier and related tools for shared memory parallel programming. In Proceedings of the 1989 International Converence on Parallel Processing (Aug 1989), II-175-II-179.Google ScholarGoogle Scholar
  31. 31 MAEKAWA, M. A ~ algorithm for mutual exclusion in decentralized systems. ACM Trans. Comput. Syst. 3, 2 (May 1985), 145-159. Google ScholarGoogle Scholar
  32. 32 MELLOR-CRUMMEY, J.M. Concurrent queues: Practical fetch-and-~ algorithms. Tech. Rep. 229, Computer Science Dept., Univ. of Rochester, Nov. 1987.Google ScholarGoogle Scholar
  33. 33 MELLOR-CRUMMEY, J. M., AND SCOTT, M. L. Algorithms for scalable synchronization on shared-memory multiprocessors. Tech. Rep. 342, Computer Science Dept., Univ. of Rochester, Apr. 1990. Also COMP TR90-114, Dept. of Computer Science, Rice Univ., May 1990 Google ScholarGoogle Scholar
  34. 34 OUSTERHOUT, J. K, SCELZA, D. A., AND SINDHU, P. S. Medusa: An experiment in distributed operating system structure. Commun. ACM 23, 2 (Feb. 1980), 92-105. Google ScholarGoogle Scholar
  35. 35 P1596 Working Group of the IEEE Computer Society Microprocessor Standards Committee. SCI (scalable coherent interface): An overview of extended cache-coherence protocols, Feb 5, 1990. Draft 0.59 P1596/Part III-D.Google ScholarGoogle Scholar
  36. 36 PETERSON, G. L. A new solution to Lamport's concurrent programming problem using small shared variables. ACM Trans. Program. Lang. Syst. 5, 1 (Jan. 1983), 56-65. Google ScholarGoogle Scholar
  37. 37 PFISTER, G., BRANTLEY, W. C., GEORGE, D. A., HARVEY, S. L., KLEINFELDER, W. J., McAvLIFFE, K. P., MELTON, T. A., NORTON, V A., AND WEISS, J. The IBM research parallel processor prototype (RP3): Introduction and architecture. In Proceedings of the 1985 International Conference on Parallel Processing (Aug. 1985), 764-771.Google ScholarGoogle Scholar
  38. 38 PFISTER, G., AND NORTON, V. A. "Hot spot" contention and combining in multistage interconnection networks. IEEE Trans. Comput. C-34, 10 (Oct. 1985), 943 948.Google ScholarGoogle Scholar
  39. 39 RAYMOND, K. A tree-based algorithm for distributed mutual exclusion. ACM Trans. Cornput. Syst. 7, i (Feb. 1989), 61-77. Google ScholarGoogle Scholar
  40. 40 RAYNAL, M. Algorithms for Mutual Exclusion. MIT Press Series in Scientific Computation. MIT Press, Cambridge, Mass., 1986. Translated from the French by D. Beeson. Google ScholarGoogle Scholar
  41. 41 REED, D. P., A~D KANODIA, R K. Synchronization with eventcounts and sequencers. Commun. ACM22, 2 (Feb. 1979), 115-123. Google ScholarGoogle Scholar
  42. 42 RETTBERG, R. D., CROWTHER, W. R., CARVEY, P. P., AND TOMHNSON, R. S. The Monarch parallel processor hardware design. Computer 23, 4 (Apr. 1990), 18-30 Google ScholarGoogle Scholar
  43. 43 RICART, G., AND AGRAWALA, A.K. An optimal algorithm for mutual exclusion in computer networks. Commun. ACM. 24, i (Jan. 1981) 9-17. Google ScholarGoogle Scholar
  44. 44 RUDOLPH, L., A~D SEGALL, Z. Dynamic decentralized cache schemes for MIMD parallel processors. In Proceedings of the International Symposium on Computer Architecture (1984), 340-347. Google ScholarGoogle Scholar
  45. 45 SANDERS, B. A. The information structure of distributed mutual exclusion algorithms. ACM Trans. Comput. Syst. 5, 3 (Aug. 1987), 284--299 Google ScholarGoogle Scholar
  46. 46 SCHNEIDER, F.B. Synchronization in distributed programs. ACM Trans. Program. Lang. Syst. 4, 2 (Apr. 1982), 179-195. Google ScholarGoogle Scholar
  47. 47 SCOTT, M. L., LEBLANC, T. J., AND MARSH, B. D. Multi-model parallel programming in Psyche. In Proceedings of the Second ACM Symposium on Principles and Practice of Parallel Programming (Mar. 1990), 70-78. Google ScholarGoogle Scholar
  48. 48 TANG, P., AND YEW, P.-C. Processor self-scheduling for multiple-nested parallel loops. In Proceedtngs of the 1986 International Conference on Parallel Processing (Aug. 1986), 528-535.Google ScholarGoogle Scholar
  49. 49 TANG, P., AND Y~W, P.-C. Algorithms for distributing hot-spot addressing CSRD Rep. 617, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana- Champaign, Jan. 1987.Google ScholarGoogle Scholar
  50. 50 YEW, P.-C. Architecture of the Cedar parallel supercomputer. CSRD Rep 609, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana-Champaign, Aug. 1986.Google ScholarGoogle Scholar
  51. 51 YEW, P.-C., TZENG, N.-F., AND LAWRIE, D.H. Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C-36, 4 (Apr. 1987), 388-395. Google ScholarGoogle Scholar
  52. 52 ZAHORIAN, J., LAZOWSKA, E. D., AND EAGER, D. L. The effect of scheduling discipline on spin overhead in shared memory parallel processors. Tech. Rep. TR-89-07-03, Computer Science Dept., Univ. of Washington, July 1989.Google ScholarGoogle Scholar

Index Terms

  1. Algorithms for scalable synchronization on shared-memory multiprocessors

            Recommendations

            Reviews

            Roy Levin

            The authors have compiled a competent survey of spin-lock and barrier synchronization algorithms, including some new ones of their own. Two sections of the paper present these algorithms with clear pseudocode. Performance comparisons on a BBN Butterfly and a Sequent Symmetry support the conclusions in the subsequent section on architectural implications. The paper closes with a concise list of recommendations that will aid software practitioners who must navigate this sometimes bewildering space of possible algorithms. I was distracted by several small inconsistencies in notation and presentation that should have been picked up in the editorial process (like the references to the nonexistent Algorithms 8 and 11). It also took me a while to realize that a direct correspondence does not always exist between the algorithms whose code is presented and the algorithms whose performance is graphed. Finally, the space devoted to a lame attempt to summarize the correctness proof of the authors' intriguing new spin-lock algorithm would have been better spent explaining how a key race-condition is resolved. These inconsistencies are relatively small flaws in an otherwise well-organized and well-reasoned paper. I encourage designers of parallel architectures and parallel algorithms to read it.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Computer Systems
              ACM Transactions on Computer Systems  Volume 9, Issue 1
              Feb. 1991
              97 pages
              ISSN:0734-2071
              EISSN:1557-7333
              DOI:10.1145/103727
              Issue’s Table of Contents

              Copyright © 1991 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 February 1991
              Published in tocs Volume 9, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader