Abstract
Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory.
We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction. We also present a new scalable barrier algorithm that generates 0(1) remote references per processor reaching the barrier, and observe that two previously-known barriers can likewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes.
We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that contention due to synchronization need not be a problem in large-scale shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called “dance hall” architectures, in which shared memory locations are equally far from all processors. —From the Authors' Abstract
- 1 ADVE, S. V., ANn HILL, M. D. Weak ordering--A new definition. In Proceedings of the International Symposium on Computer Architecture (May 1990), 2-14. Google Scholar
- 2 AGARWAL, A., AND CHEmAN, M. Adaptive backoff synchronization techniques. In Proceedings of the International Symposium on Computer Architecture (May 1989), 396 406. Google Scholar
- 3 AGARWAL, A., SIMONI, R., HENNESSY, J., AND HOROWITZ, M. An evaluation of directory schemes for cache coherence. In Proceedings of the International Symposium on Computer Architecture (June 1988), 280-289. Google Scholar
- 4 ALMASI, G. S., ANn GOTTLIEB, A. Highly Parallel Computing. Benjamin/Cummings, Redwood City, Calif., 1989. Google Scholar
- 5 ANDERSON, T.E. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distributed Syst. 1, I (Jan. 1990), 6-16. Google Scholar
- 6 ANDEaSON, T. E., L~ZOWSKA, E. D., AND LEVY, H. M. The performance implications of thread management alternatives for shared-memory multiprocessors IEEE Trans. Comput. 38, 12 (Dec. 1989), 1631-1644. Google Scholar
- 7 A~CnmALn, J. AND BAER, J.-L. An economical solution to the cache coherence problem. In Proceedings of the International Symposium on Computer Architecture (1984), 355-362. Google Scholar
- 8 BBN Laboratories. Butterfly parallel processor overview. Tech. Rep. 6148, version 1, BBN Laboratories, Cambridge, Mass, Mar. 1986.Google Scholar
- 9 BROOKS, E. D., III. The butterfly barrier Int. J Parallel Program. 15, 4 (1986), 295-307. Google Scholar
- 10 BZOOKS, E. D., III. The shared memory hypercube. Parallel Comput. 6 (1988), 235-245.Google Scholar
- 11 DAVIS, H, AND HENNESSY, J. Characterizing the synchronization behavior of parallel programs. In Proceedings of the ACM Conference on Parallel Programming: Experzence with Apphcat~ons, Languages and Systems (July 1988), 198-211. Google Scholar
- 12 DIJKSTRA, E Solution of a problem in concurrent programming and control. Commun. ACM. 8, 9 (Sept. 1965), 569. Google Scholar
- 13 GHARACHORLOO, K., LENOSKI, D., LAUDON, J., GmBONS, P., GUPTA, A., AND HENNESSY, J. L. Memory consistency and event ordering in scalable shared.memory multiprocessors. In Proceedlngs of the International Symposium on Computer Architecture (May 1990), 15-26. Google Scholar
- 14 GOODMAN, J.R, VERNON, M. K., AND WOES% P.J. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (Apr. 1989), 64-75 Google Scholar
- 15 GOOVMAN, J R., AND WOEST, P. J. The Wisconsin Multlcube: A new large-scale cache coherent multiprocessor. In Proceedings of the International Symposium on Computer Architecture (May 1988), 422-431. Google Scholar
- 16 GOTTLIEB, A. Scalability, combining and the NYU ultracomputer. Ohio State University Parallel Computing Workshop, Mar. 1990. Invited Lecture.Google Scholar
- 17 GOTTLIEB, A., GRISHMAN, R., KRUSKAL, C. P., McAuLIFFE, K. P., RUDOLPH, L., AND SNIR, M. The NYU Ultracomputer--Designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32, 2 (Feb. 1983), 175-189.Google Scholar
- 18 GRAUNI<E, G., AND THAKKAR, S. Synchronization algorithms for shared-memory multiprocessors. Computer 23, 6 (June 1990), 60-69 Google Scholar
- 19 HENSGEN, D , FINKEL, R , AND MANBER, U. Two algorithms for barrier synchroniztion. Int. J. Parallel Program. 17, 1 (1988) 1-17. Google Scholar
- 20 HERLIHY, M. A methodology for implementing highly concurrent data structures. In Proceed~ngs of the Second A CM Symposium on Principles and Practice of Parallel Programming (Mar 1990), 197-206. Google Scholar
- 21 JAYASIMHA, D. N. Distributed synchronizers. In Proceedings of the 1988 Internatwnal Conference on Parallel Processing (Aug. 1988), 23-27.Google Scholar
- 22 KRUSKAL, C.P., RUDOLPH, L., AND SNm, M. Efficient synchronization on multiprocessors with shared memory. In Proceedings of the F~fih ACM Symposium on Principles of Dzstributecl Computing (1986), 218-228. Google Scholar
- 23 KUMAR, M., AND PFISTER, G.F. The onset of hot spot contention In Proceedings of the 1986 International Conference on Parallel Processing (1986), 28-34.Google Scholar
- 24 LAMPORT, L. A new solution of Dijkstra's concurrent programming problem. Commun. ACM 17, 8 (Aug. 1974), 453-455. Google Scholar
- 25 LAMPORT, L. The mutual exclusion problem: Part I--A theory of interprocess communication: Part II--Statement and solutions. J ACM 33, 2 (Apr. 1986), 313-348. Google Scholar
- 26 LAMPORT, L. A fast mutual exclusion algorithm. ACM Trans. Comput. Syst. 5, 1 (Feb. 1987) 1-11. Google Scholar
- 27 LEE, C.A. Barrier synchronization over multistage interconnection networks. In Proceedrags of the Second IEEE Symposium on Parallel and Distributed Processing (Dallas, Tex, Dec 1990), 130-133.Google Scholar
- 28 LEE, J., AND RAMACHANDRAN, U. Synchronization with multiprocessor caches. In Proceedrags of the Internatwnal Symposlum on Computer Architecture (May 1990), 27-37. Google Scholar
- 29 LUBACHEVSK~, B. An approach to automating the verification of compact parallel coordination programs. I Acta Inf. 21 (1984), 125-169.Google Scholar
- 30 LUBACnEVSKY, B. Synchronization barrier and related tools for shared memory parallel programming. In Proceedings of the 1989 International Converence on Parallel Processing (Aug 1989), II-175-II-179.Google Scholar
- 31 MAEKAWA, M. A ~ algorithm for mutual exclusion in decentralized systems. ACM Trans. Comput. Syst. 3, 2 (May 1985), 145-159. Google Scholar
- 32 MELLOR-CRUMMEY, J.M. Concurrent queues: Practical fetch-and-~ algorithms. Tech. Rep. 229, Computer Science Dept., Univ. of Rochester, Nov. 1987.Google Scholar
- 33 MELLOR-CRUMMEY, J. M., AND SCOTT, M. L. Algorithms for scalable synchronization on shared-memory multiprocessors. Tech. Rep. 342, Computer Science Dept., Univ. of Rochester, Apr. 1990. Also COMP TR90-114, Dept. of Computer Science, Rice Univ., May 1990 Google Scholar
- 34 OUSTERHOUT, J. K, SCELZA, D. A., AND SINDHU, P. S. Medusa: An experiment in distributed operating system structure. Commun. ACM 23, 2 (Feb. 1980), 92-105. Google Scholar
- 35 P1596 Working Group of the IEEE Computer Society Microprocessor Standards Committee. SCI (scalable coherent interface): An overview of extended cache-coherence protocols, Feb 5, 1990. Draft 0.59 P1596/Part III-D.Google Scholar
- 36 PETERSON, G. L. A new solution to Lamport's concurrent programming problem using small shared variables. ACM Trans. Program. Lang. Syst. 5, 1 (Jan. 1983), 56-65. Google Scholar
- 37 PFISTER, G., BRANTLEY, W. C., GEORGE, D. A., HARVEY, S. L., KLEINFELDER, W. J., McAvLIFFE, K. P., MELTON, T. A., NORTON, V A., AND WEISS, J. The IBM research parallel processor prototype (RP3): Introduction and architecture. In Proceedings of the 1985 International Conference on Parallel Processing (Aug. 1985), 764-771.Google Scholar
- 38 PFISTER, G., AND NORTON, V. A. "Hot spot" contention and combining in multistage interconnection networks. IEEE Trans. Comput. C-34, 10 (Oct. 1985), 943 948.Google Scholar
- 39 RAYMOND, K. A tree-based algorithm for distributed mutual exclusion. ACM Trans. Cornput. Syst. 7, i (Feb. 1989), 61-77. Google Scholar
- 40 RAYNAL, M. Algorithms for Mutual Exclusion. MIT Press Series in Scientific Computation. MIT Press, Cambridge, Mass., 1986. Translated from the French by D. Beeson. Google Scholar
- 41 REED, D. P., A~D KANODIA, R K. Synchronization with eventcounts and sequencers. Commun. ACM22, 2 (Feb. 1979), 115-123. Google Scholar
- 42 RETTBERG, R. D., CROWTHER, W. R., CARVEY, P. P., AND TOMHNSON, R. S. The Monarch parallel processor hardware design. Computer 23, 4 (Apr. 1990), 18-30 Google Scholar
- 43 RICART, G., AND AGRAWALA, A.K. An optimal algorithm for mutual exclusion in computer networks. Commun. ACM. 24, i (Jan. 1981) 9-17. Google Scholar
- 44 RUDOLPH, L., A~D SEGALL, Z. Dynamic decentralized cache schemes for MIMD parallel processors. In Proceedings of the International Symposium on Computer Architecture (1984), 340-347. Google Scholar
- 45 SANDERS, B. A. The information structure of distributed mutual exclusion algorithms. ACM Trans. Comput. Syst. 5, 3 (Aug. 1987), 284--299 Google Scholar
- 46 SCHNEIDER, F.B. Synchronization in distributed programs. ACM Trans. Program. Lang. Syst. 4, 2 (Apr. 1982), 179-195. Google Scholar
- 47 SCOTT, M. L., LEBLANC, T. J., AND MARSH, B. D. Multi-model parallel programming in Psyche. In Proceedings of the Second ACM Symposium on Principles and Practice of Parallel Programming (Mar. 1990), 70-78. Google Scholar
- 48 TANG, P., AND YEW, P.-C. Processor self-scheduling for multiple-nested parallel loops. In Proceedtngs of the 1986 International Conference on Parallel Processing (Aug. 1986), 528-535.Google Scholar
- 49 TANG, P., AND Y~W, P.-C. Algorithms for distributing hot-spot addressing CSRD Rep. 617, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana- Champaign, Jan. 1987.Google Scholar
- 50 YEW, P.-C. Architecture of the Cedar parallel supercomputer. CSRD Rep 609, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana-Champaign, Aug. 1986.Google Scholar
- 51 YEW, P.-C., TZENG, N.-F., AND LAWRIE, D.H. Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C-36, 4 (Apr. 1987), 388-395. Google Scholar
- 52 ZAHORIAN, J., LAZOWSKA, E. D., AND EAGER, D. L. The effect of scheduling discipline on spin overhead in shared memory parallel processors. Tech. Rep. TR-89-07-03, Computer Science Dept., Univ. of Washington, July 1989.Google Scholar
Index Terms
- Algorithms for scalable synchronization on shared-memory multiprocessors
Recommendations
Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors
Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their utilization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this ...
A low-latency scalable locking algorithm for shared memory multiprocessors
SPDP '94: Proceedings of the 1994 6th IEEE Symposium on Parallel and Distributed ProcessingIn this paper we deal exclusively with spin locks for small- to medium-scale shared memory multiprocessors. The performance of a lock depends on the contention for the critical section being protected by the lock. The selection of the most appropriate ...
Comments