article

Open Access

Eliminating synchronization bottlenecks using adaptive replication

Authors:
Martin C. Rinard

Massachusetts Institute of Technology, Cambridge, MA

Massachusetts Institute of Technology, Cambridge, MA
View Profile

,
Pedro C. Diniz

University of Southern California, Marina, CA

University of Southern California, Marina, CA
View Profile

Authors Info & Claims

ACM Transactions on Programming Languages and Systems Volume 25 Issue 3pp 316–359https://doi.org/10.1145/641909.641911

Published:01 May 2003Publication History

ACM Transactions on Programming Languages and Systems

Abstract

This article presents a new technique, adaptive replication, for automatically eliminating synchronization bottlenecks in multithreaded programs that perform atomic operations on objects. Synchronization bottlenecks occur when multiple threads attempt to concurrently update the same object. It is often possible to eliminate synchronization bottlenecks by replicating objects. Each thread can then update its own local replica without synchronization and without interacting with other threads. When the computation needs to access the original object, it combines the replicas to produce the correct values in the original object. One potential problem is that eagerly replicating all objects may lead to performance degradation and excessive memory consumption.Adaptive replication eliminates unnecessary replication by dynamically detecting contention at each object to find and replicate only those objects that would otherwise cause synchronization bottlenecks. We have implemented adaptive replication in the context of a parallelizing compiler for a subset of C++. Given an unannotated sequential program written in C++, the compiler automatically extracts the concurrency, determines when it is legal to apply adaptive replication, and generates parallel code that uses adaptive replication to efficiently eliminate synchronization bottlenecks.In addition to automatic parallelization and adaptive replication, our compiler also implements a lock coarsening transformation that increases the granularity at which the computation locks objects. The advantage is a reduction in the frequency with which the computation acquires and releases locks; the potential disadvantage is the introduction of new synchronization bottlenecks caused by increases in the sizes of the critical sections. Because the adaptive replication transformation takes place at lock acquisition sites, there is a synergistic interaction between lock coarsening and adaptive replication. Lock coarsening drives down the overhead of using adaptive replication, and adaptive replication eliminates synchronization bottlenecks associated with the overaggressive use of lock coarsening.Our experimental results show that, for our set of benchmark programs, the combination of lock coarsening and adaptive replication can eliminate synchronization bottlenecks and significantly reduce the synchronization and replication overhead as compared to versions that use none or only one of the transformations.

References

Aldrich, J., Chambers, C., Sirer, E., and Eggers, S. 1999. Static analyses for eliminating unnecessary synchronization from Java programs. In Proceedings of the 6th International Static Analysis Symposium.Google Scholar
Amza, C., Cox, A., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., and Zwaenepoel, W. 1996. TreadMarks: Shared memory computing on networks of workstations. IEEE Comput. 29, 2 (June), 18--28. Google Scholar
Arnold, K. and Gosling, J. 1996. The Java Programming Language. Addison-Wesley, Reading, Mass. Google Scholar
Barnes, J. and Hut, P. 1986. A hierarchical O(NlogN) force calculation algorithm. Nature 324, 4 (Dec.), 446--449.Google Scholar
Berry, M., Chen, D., Koss, P., Kuck, D., Lo, S., Pang, Y., Pointer, L., Roloff, R., Sameh, A., Clementi, E., Chin, S., Schneider, D., Fox, G., Messina, P., Walker, D., Hsiung, C., Schwarzmeier, J., Lue, K., Orszag, S., Seidl, F., Johnson, O., Goodrum, R., and Martin, J. 1989. The Perfect Club benchmarks: Effective performance evaluation of supercomputers. ICASE Report 827, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana-Champaign, Urbana, IL. May.Google Scholar
Brinch-Hansen, P. 1972. Structured multiprogramming. Commun. ACM 15, 7 (July), 574--578. Google Scholar
Brinch-Hansen, P. 1975. The programming language Concurrent Pascal. IEEE Trans. Softw. Eng. SE-1, 2 (June), 199--207.Google Scholar
Callahan, D. 1991. Recognizing and parallelizing bounded recurrences. In Proceedings of the 4th Workshop on Languages and Compilers for Parallel Computing (Santa Clara, Calif.). Springer-Verlag, New York, 169--184. Google Scholar
Chase, D., Wegman, M., and Zadek, F. 1990. Analysis of pointers and structures. In Proceedings of the SIGPLAN '90 Conference on Program Language Design and Implementation (White Plains, N.Y.) ACM, New York, 296--310. Google Scholar
Diniz, P. and Rinard, M. 1998. Lock coarsening: Eliminating lock overhead in automatically parallelized object-based programs. J. Parall. Distrib. Comput. 49, 2 (Mar.), 2218--244. Google Scholar
Diniz, P. and Rinard, M. 1999. Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback. ACM Trans. Comput. Syst. 17, 2 (May), 89--132. Google Scholar
Dolby, J. 1997. Automatic inline allocation of objects. In Proceedings of the SIGPLAN '97 Conference on Program Language Design and Implementation (Las Vegas, Nev.). ACM, New York. Google Scholar
Doligez, D. and Leroy, X. 1993. A concurrent, generational garbage collector for a multithreaded implementation of ML. In Proceedings of the 20th Annual ACM Symposium on the Principles of Programming Languages. ACM, New York. Google Scholar
Emami, M., Ghiya, R., and Hendren, L. 1994. Context-sensitive interprocedural points-to analysis in the presence of function pointers. In Proceedings of the SIGPLAN '94 Conference on Program Language Design and Implementation (Orlando, Fl.) ACM, New York, 242--256. Google Scholar
Fisher, A. and Ghuloum, A. 1994. Parallelizing complex scans and reductions. In Proceedings of the SIGPLAN '94 Conference on Program Language Design and Implementation (Orlando, Fla.) ACM, New York, 135--144. Google Scholar
Ghiya, R. and Hendren, L. 1996. Is it a tree, a DAG or a cyclic graph? A shape analysis for heap-directed pointers in C. In Proceedings of the 23rd Annual ACM Symposium on the Principles of Programming Languages. ACM, New York, 1--15. Google Scholar
Ghuloum, A. and Fisher, A. 1995. Flattening and parallelizing irregular, recurrent loop nests. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, Calif.). ACM, New York, 58--67. Google Scholar
Graham, S., Kessler, P., and McKusick, M. 1982. gprof: A call graph execution profiler. In Proceedings of the SIGPLAN '82 Symposium on Compiler Construction (Boston, Mass.). ACM, New York. Google Scholar
Gray, J. and Reuter, A. 1993. Transaction Processing: Concepts and Techniques. Morgan-Kaufmann, San Francisco, CA. Google Scholar
Hall, M., Amarasinghe, S., Murphy, B., Liao, S., and Lam, M. 1995. Detecting coarse-grain parallelism using an interprocedural parallelizing compiler. In Proceedings of Supercomputing '95 (San Diego, Calif.). IEEE Computer Society Press, Los Alamitos, Calif. Google Scholar
Harris, J., Lazaratos, S., and Michelena, R. 1990. Tomographic string inversion. In Proceedings of the 60th Annual International Meeting, Society of Exploration and Geophysics, Extended Abstracts. 82--85.Google Scholar
Heinrich, J. 1993. MIPS R4000 Microprocessor User's Manual. Prentice-Hall, Englewood Cliffs, N.J.Google Scholar
Hoare, C. A. R. 1974. Monitors: An operating system concept. Commun. ACM 17, 10 (Oct.), 549--557. Google Scholar
Huelsbergen, L. and Larus, J. 1993. A concurrent copying garbage collector for languages that distinguish (im)mutable data. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, Calif.). ACM, New York. Google Scholar
King, J. 1976. Symbolic execution and program testing. Commun. ACM 19, 7 (July), 385--394. Google Scholar
Knuth, D. 1971. An empirical study of FORTRAN programs. Softw.---Pract. Exper. 1, 105--133.Google Scholar
Lampson, B. W. and Redell, D. D. 1980. Experience with processes and monitors in Mesa. Commun. ACM 23, 2 (Feb.), 105--117. Google Scholar
Lenoski, D. 1992. The design and analysis of DASH: A scalable directory-based multiprocessor. Ph.D. thesis, Dept. of Electrical Engineering, Stanford Univ., Stanford, Calif. Google Scholar
Li, K. 1986. Shared virtual memory on loosely coupled multiprocessors. Ph.D. dissertation, Dept. of Computer Science, Yale Univ., New Haven, Conn. Google Scholar
Lumetta, S., Murphy, L., Li, X., Culler, D., and Khalil, I. 1993. Decentralized optimal power pricing: The development of a parallel program. IEEE Parall. Distrib. Tech. 1, 4 (Nov.), 23--31. Google Scholar
Mohr, E., Kranz, D., and Halstead, R. 1990. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on Lisp and Functional Programming. ACM, New York, 185--197. Google Scholar
O'Toole, J. and Nettles, S. 1994. Concurrent replicating garbage collection. In Proceedings of the 1994 ACM Conference on Lisp and Functional Programming (Orlando, Fla.). ACM, New York. Google Scholar
Pinter, S. and Pinter, R. 1991. Program optimization and parallelization using idioms. In Proceedings of the 18th Annual ACM Symposium on the Principles of Programming Languages (Orlando, Fla.). ACM, New York, 79--92. Google Scholar
Polychronopoulos, C. and Kuck, D. 1987. Guided self-scheduling: A practical scheduling scheme for parallel computers. IEEE Trans. Comput. 36, 12 (Dec.), 1425--1439. Google Scholar
Rinard, M. 1999. Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives. ACM Trans. Comput. Syst. 17, 4 (Nov.), 337--371. Google Scholar
Rinard, M. and Diniz, P. 1997. Commutativity analysis: A new analysis technique for parallelizing compilers. ACM Trans. Prog. Lang. Syst. 19, 6 (Nov.), 941--992. Google Scholar
Rinard, M. and Diniz, P. 1999. Eliminating synchronization bottlenecks in object-based programs using adaptive replication. In Proceedings of the 1999 ACM International Conference on Supercomputing (Rhodes, Greece). ACM, New York. Google Scholar
Rinard, M. and Lam, M. 1998. The design, implementation, and evaluation of jade. ACM Trans. Prog. Lang. Syst. 20, 3 (May), 483--545. Google Scholar
Rovner, P. 1986. Extending modula-2 to build large, integrated systems. IEEE Softw. 3, 6 (Nov.), 46--57.Google Scholar
Rugina, R. and Rinard, M. 1999. Pointer analysis for multithreaded programs. In Proceedings of the SIGPLAN '99 Conference on Program Language Design and Implementation (Atlanta, Ga.). ACM, New York. Google Scholar
Sagiv, M., Reps, T., and Wilhelm, R. 1998. Solving shape-analysis problems in languages with destructive updating. ACM Trans. Prog. Lang. Syst. 20, 1 (Jan.), 1--50. Google Scholar
Sha, L., Rajkumar, R., and Lehoczky, J. 1990. Priority inheritance protocols: An approach to real-time synchronization. IEEE Trans. Comput. 39, 9 (Sept.), 1175--1185. Google Scholar
Singh, J. 1993. Parallel hierarchical N-body methods and their implications for multiprocessors. Ph.D. dissertation, Dept. of Electrical Engineering, Stanford Univ., Stanford, Calif. Google Scholar
Singh, J., Weber, W., and Gupta, A. 1992. SPLASH: Stanford parallel applications for shared memory. Comput. Arch. News 20, 1 (Mar.), 5--44. Google Scholar
Wilson, R. and Lam, M. 1995. Efficient context-sensitive pointer analysis for C programs. In Proceedings of the SIGPLAN '95 Conference on Program Language Design and Implementation (La Jolla, Calif.). ACM, New York. Google Scholar
Woo, S., Ohara, M., Torrie, E., Singh, J., and Gupta, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. ACM, New York. Google Scholar

Index Terms

Eliminating synchronization bottlenecks using adaptive replication
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives

This article presents our experience using optimistic synchronization to implement fine-grain atomic operations in the context of a parallelizing compiler for irregular, object-based computations. Our experience shows that the synchronization ...
Read More
Optimistic transactional active replication
ICUIMC '08: Proceedings of the 2nd international conference on Ubiquitous information management and communication

Critical database applications require 2-safe replication between at least two sites for disaster-tolerant services. At the same time, they must provide consistent and low-latency results to their clients in normal cases. In this paper, we propose ...
Read More
Quorum-based synchronization protocols for multimedia replicas

Multiple replicas of multimedia objects are distributed to peers in overlay networks. In quorum-based (QB) protocols, every replica may not be up-to-date and the up-to-date replica can be found in the version counter. Multimedia objects are ...
Read More

Reviews

Reviewer: Olivier Louis Marie Lecarme

Long and dense, this paper contains five important contributions, each of which could be the basis of an interesting paper at a technical conference. As it is, it can be recommended to people seriously studying current problems in the compilation of parallel programs. The main problem attacked by the authors is that of implementing multithreaded programs in an efficient way. In order to avoid synchronization bottlenecks, which forbid parallel execution in large parts of the program when they attempt to update the same objects, one can replicate these objects, and perform updates on the local replicas. In order to reduce the cost of repeatedly acquiring and releasing the synchronization locks, one can use lock coarsening, which increases the granularity of this locking. These two transformations seem to be contradictory. In fact, the authors show that they may interact in a positive way. Following are the contributions of this paper: a static program analysis algorithm for replicating objects only in legal situations; a program transformation algorithm for generating the replication code; an adaptive replication algorithm for dynamically adapting the program to its specific execution pattern; identification of the synergy between lock coarsening and adaptive replication; and several experimental results, which characterize the performance impact of the two transformations on three benchmarks. The authors' solution has been put to use in an experimental parallelizing compiler for a subset of C++. Although the three benchmarks are well described, as is the history of their construction, the specific subset of C++ used is not described. The results of the experiments are described in much detail, and occupy more than a third of the paper's length. The paper ends with an interesting review of related work. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Programming Languages and Systems Volume 25, Issue 3
May 2003
109 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/641909
Issue’s Table of Contents

Copyright © 2003 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2003
Published in toplas Volume 25, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Atomic operations
commutativity analysis
parallel computing
parallelizing compilers
replication
synchronization
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 1,011
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Eliminating synchronization bottlenecks using adaptive replication

ACM Transactions on Programming Languages and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives

Optimistic transactional active replication

Quorum-based synchronization protocols for multimedia replicas

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Eliminating synchronization bottlenecks using adaptive replication

ACM Transactions on Programming Languages and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives

Optimistic transactional active replication

Quorum-based synchronization protocols for multimedia replicas

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media