research-article

Synchronization optimizations for efficient execution on multi-cores

Authors:

Alexandru Nicolau,

Alexander V. Veidenbaum,

Arun KejariwalAuthors Info & Claims

ICS '09: Proceedings of the 23rd international conference on Supercomputing

Pages 169 - 180

https://doi.org/10.1145/1542275.1542303

Published: 08 June 2009 Publication History

Abstract

Multi-cores are becoming ubiquitous as exemplified by Sun's Niagra-2, Intel's Nehalem and AMD's Sau Paulo octal cores. The number of cores per chip is expected to rise in foreseeable future, as evidenced by the recently announced Intel's 80-core Teraflops Research Chip. Exploiting the parallelism of multicores necessitates concurrent software. One way to parallelize programs, not amenable to auto-parallelization, is via explicit synchronization. The placement of the synchronization primitives has a large bearing on how much thread-level parallelism (TLP) can be achieved. In this paper, we propose novel predication-based and other adjunct synchronization optimizations which facilitate exploitation on higher level of TLP than what can be achieved using the state-of-the-art. We demonstrate the efficacy of our techniques, on a real machine, using real codes, specifically, from the industry-standard SPEC CPU benchmarks and other widely used open source codes such as PostgreSQL. Our results show that the proposed techniques yield significantly higher levels of TLP than the state-of-the-art.

References

[1]

S. F. Lundstrom and G. H. Barnes. A controllable MIMD architectures. In Proceedings of the 1980 International Conference on Parallel Processing, pages 19--27, St. Charles, IL, August 1980.

[2]

U. Banerjee. Dependence Analysis. Kluwer Academic Publishers, Boston, MA, 1997.

Digital Library

[3]

A. Nicolau, G. Li, and A. Kejariwal. Techniques for efficient placement of synchronization primitives. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Raleigh, NC, February 2009.

Digital Library

[4]

OpenMP Specification, version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.

[5]

SPEC CPU Benchmarks. http://www.spec.org/benchmarks.html.

[6]

PostgreSQL. http://www.postgresql.org/.

[7]

A. Kejariwal, X. Tian, W. Li, M. Girkar, S. Kozhukhov, H. Saito, U. Banerjee, A. Nicolau, A. V. Veidenbaum, and C. D. Polychronopoulos. On the performance potential of different types of speculative thread-level parallelism. In Proceedings of the 20th ACM International Conference on Supercomputing, pages 24--35, Cairns, Australia, 2006.

Digital Library

[8]

SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.

[9]

K. Karplus and A. Nicolau. Efficient hardware for multiway jumps and pre-fetches. In Proceedings of the 18th annual workshop on Micro-programming, pages 11--18, 1985.

Digital Library

[10]

D. Kuck. The Structure of Computers and Computations, VOLUME 1. John Wiley and Sons, New York, NY, 1978.

Digital Library

[11]

M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996.

Digital Library

[12]

J. A. Fisher. Trace Scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, C-30(7):478--490, July 1981.

Digital Library

[13]

S. Muchnick. Advanced Compiler Design Implementation. Second edition, 2000.

Digital Library

[14]

R. Johnson and M. Schlansker. Analysis techniques for predicated code. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 100--113, Paris, France, 1996.

Digital Library

[15]

J. W. Sias, W.-M. W. Hwu, and D. I. August. Accurate and efficient predicate analysis with binary decision diagrams. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pages 112--123, Monterey, CA, 2000.

Digital Library

[16]

L. Shen, Z. Wang, and J. Lu. Predicate analysis based on path information. In Advanced Parallel Processing Technologies, LNCS, 2834, pages 147--151, 2003.

[17]

J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Conference Record of the Tenth Annual ACM Symposium on the Principles of Programming Languages, Austin, TX, January 1983.

Digital Library

[18]

J. Park and M. Schlansker. On predicated execution. Technical Report 58--91, Hewlett Packard Laboratories, 1991.

[19]

D. N. Pnevmatikatos and G. S. Sohi. Guarded execution and branch prediction in dynamic ILP processors. SIGARCH Computer Architecture News, 22(2):120--129, 1994.

Digital Library

[20]

D. M. Gillies, D. c. Roy Ju, R. Johnson, and M. Schlansker. Global predicate analysis and its application to register allocation. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 114--125, Paris, France, 1996.

Digital Library

[21]

R. Fitzgerald, T. B. Knoblock, E. Ruf, B. Steensgaard, and D. Tarditi. Marmot: an optimizing compiler for Java. Technical report, Microsoft, November 1998.

[22]

J. Aldrich, C. Chambers, E. G. Sirer, and S. J. Eggers. Static analyses for eliminating unnecessary synchronization from java programs. In Proceedings of the 6th International Symposium on Static Analysis, pages 19--38, 1999.

Digital Library

[23]

A. Heydon and M. Najork. Performance limitations of the java core libraries. In Proceedings of the ACM 1999 conference on Java Grande, pages 35--41, San Francisco, CA, 1999.

Digital Library

[24]

J. Bogda and U. Hölzle. Removing unnecessary synchronization in java. SIGPLAN Notices, 34(10):35--46, 1999.

Digital Library

[25]

D. F. Bacon, R. Konuru, C. Murthy, and M. J. Serrano. Thin locks: Featherweight synchronization for java. ACM SIGPLAN Notices, 39(4):583--595, 2004.

Digital Library

[26]

S. Sridharan, A. Rodrigues, and P. Kogge. Evaluating synchronization techniques for light-weight multithreaded/multicore architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 57--58, San Diego, CA, 2007.

Digital Library

[27]

W. Zhu, V. C Sreedhar, Z. Hu, and G. R. Gao. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th International Symposium on Computer Architecture, pages 35--45, San Diego, CA, 2007.

Digital Library

[28]

A. Kejariwal, X. Tian, H. Saito, W. Li, M. Girkar, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. Lightweight lock-free synchronization methods for multithreading. In Proceedings of the 20th ACM International Conference on Supercomputing, pages 361--371, Cairns, Australia, 2006.

Digital Library

[29]

Z. Ammarguellat and W. Harrison. Automatic recognition of induction variables and recurrence relations by abstract interpretation. In Proceedings of the SIGPLAN'90 Conference on Programming Language Design and Implementation, White Plains, NY, June 1990.

Digital Library

[30]

W. M. Pottenger. Induction variable substitution and reduction recognition in the Polaris parallelizing compiler. Master's thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1995.

[31]

M. P. Gerlek, E. Stoltz, and M. Wolfe. Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form. ACM Transactions on Programming Languages and Systems, 17(1):85--122, 1995.

Digital Library

[32]

M. R. Haghighat and Constantine D. Polychronopoulos. Symbolic analysis for parallelizing compilers. ACM Transactions on Programming Languages and Systems, 18(4):477--518, July 1996.

Digital Library

[33]

P. Wu, A. Cohen, J. Hoe inger, and D. Padua. Monotonic evolution: An alternative to induction variable substitution for dependence analysis. In Proceedings of the 15th ACM International Conference on Supercomputing, pages 78--91, Sorrento, Italy, 2001.

Digital Library

[34]

D. Gries. Compiler Construction for Digital Computers. John Wiley and Sons, 1971.

Digital Library

[35]

A. V. Aho and J. D. Ullman. The theory of parsing, translation and compiling. In Compiling. Prentice-Hall, Englewood Cliffs, NJ, 1973.

[36]

D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. J. Wolfe. Dependence graphs and compiler optimizations. In Conference Record of the Eighth Annual ACM Symposium on the Principles of Programming Languages, Williamsburg, VA, January 1981.

Digital Library

[37]

R. Cytron and J. Ferrante. What's in a name? or the value of renaming for parallelism detection and storage allocation. In Proceedings of the 1987 International Conference on Parallel Processing, St. Charles, IL, August 1987.

[38]

S. Weatherford. High level pattern matching extension to C++ for fortran program manipulation in Polaris. Master's thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1994.

[39]

Wine. http://sourceforge.net/project/showfiles.php?group id=6241.

[40]

SPEC CPU2006. http://www.spec.org/cpu2006.

[41]

MySQL: The world's most popular open source database. http://www.mysql.com/.

[42]

A. Kejariwal and A. Nicolau. Reading list of mutual exclusion, locking, synchronization and concurrent objects. http://www.ics.uci.edu/-akejariw/ConcurrentExecutionReadingList.pdf.

[43]

R. Cytron. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing, pages 836--844, St. Charles, IL, August 1986.

[44]

S. Midkiff and D. Padua. Compiler generated synchronization for DO loops. In Proceedings of the 1986 International Conference on Parallel Processing, pages 544--551, St. Charles, IL, August 1986.

[45]

H. Kasahara, H. Honda, M. Iwata, and M. Hirota. A compilation scheme for macro-dataow computation on hierarchical multiprocessor systems. In Proceedings of the International Conference on Parallel Processing, pages II294--II295, Urbana-Champaign, IL, August 1990.

[46]

M. B. Girkar. Functional Parallelism Theoretical Foundations and Implementation. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, December 1991.

Digital Library

[47]

R. Cytron, M. Hind, and W. Hsieh. Automatic generation of DAG parallelism. In Proceedings of the SIGPLAN'89 Conference on Programming Language Design and Implementation, pages 54--68, Portland, OR, 1989.

Digital Library

[48]

V. Sarkar. Instruction reordering for fork-join parallelism. In Proceedings of the SIGPLAN'90 Conference on Programming Language Design and Implementation, pages 322--336, White Plains, NY, 1990.

Digital Library

[49]

W. N. Scherer III and M. L. Scott. Nonblocking concurrent objects with condition synchronization. In Proceedings of the 18th International Symposium on Distributed Computing, Amsterdam, The Netherlands, Oct 2004.

[50]

F. Fich, D. Hendler, and N. Shavit. On the inherent weakness of conditional synchronization primitives. http://www.cs.tau.ac.il/-afek/Handler-conditionals.pdf.

[51]

M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, 1990.

Digital Library

[52]

P. Diniz and M. Rinard. Synchronization transformations for parallel computing. In Proceedings of the Twenty-fourth Annual ACM Symposium on the Principles of Programming Languages, pages 187--200, Paris, France, 1997.

Digital Library

[53]

P. C. Diniz and M. C. Rinard. Lock coarsening: eliminating lock over-head in automatically parallelized object-based programs. Journal of Parallel and Distributed Computing, 49(2):218--244, 1998.

Digital Library

[54]

M. Rinard. Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives. ACM Transactions on Computer Systems, 17(4):337--371, 1999.

Digital Library

[55]

C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Santa Barbara, CA, July 1995.

Digital Library

[56]

S. S. Bhattacharyya, S. Sriram, and E. A. Lee. Resynchronization of multiprocessor schedules Part I: Fundamental concepts and unbounded latency analysis. Memorandum UCB/ERL M96/55, Electronics Research Laboratory, U. C. Berkeley, October 1996.

[57]

D.-K. Chen and P.-C. Yew. Redundant synchronization elimination for DOACROSS loops. IEEE Transactions on Parallel and Distributed Systems, 10(5):459--470, 1999.

Digital Library

[58]

A. Nicolau, R. Potasman, and H. Wang. Register allocation, renaming and their impact on fine-grain parallelism. In Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing, pages 218--235, 1991.

Digital Library

[59]

A. Moshovos and G. S. Sohi. Streamlining inter-operation memory communication via data dependence prediction. In Proceedings of the 30th International Symposium of Microarchitecture, pages 235--245, Research Triangle Park, NC, 1997.

Digital Library

[60]

G. S. Tyson and T. M. Austin. Memory renaming: Fast, early and accurate processing of memory communication. International Journal of Parallel Programming, 27(5):357--380, 1999.

Digital Library

[61]

G. Reinman, B. Calder, D. Tullsen, G. Tyson, and T. Austin. Classifying load and store instructions for memory renaming. In Proceedings of the 13th ACM International Conference on Supercomputing, pages 399--407, Rhodes, Greece, 1999.

Digital Library

[62]

K. Knobe and V. Sarkar. Array SSA form and its use in parallelization. pages 107--120, San Diego, CA, 1998.

[63]

J.-F. Collard. Array SSA for explicitly parallel programs. In Proceedings of the 5th International Euro-Par Conference, pages 383--390, Toulouse, France, August 1999.

Digital Library

[64]

S. Rus, G. He, C. Alias, and L. Rauchwerger. Region array ssa. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, pages 43--52, Seattle, WA, 2006.

Digital Library

Cited By

Liu LLiu MWang CWang J(2016)Compile-Time Automatic Synchronization Insertion and Redundant Synchronization Elimination for GPU Kernels2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2016.0112(826-834)Online publication date: Dec-2016
https://doi.org/10.1109/ICPADS.2016.0112
Aloor RNandivada VBhuyan LChong FSarkar V(2015)Unique Worker model for OpenMPProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751238(47-56)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751238
Jääskeläinen PLama CSchnetter ERaiskila KTakala JBerg H(2015)poclInternational Journal of Parallel Programming10.1007/s10766-014-0320-y43:5(752-785)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1007/s10766-014-0320-y
Show More Cited By

Index Terms

Synchronization optimizations for efficient execution on multi-cores
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Techniques for efficient placement of synchronization primitives
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Harnessing the hardware parallelism of the emerging multi-cores systems necessitates concurrent software. Unfortunately, most of the existing mainstream software is sequential in nature. Although one could auto-parallelize a given program, the efficacy ...
Techniques for efficient placement of synchronization primitives
PPoPP '09

Harnessing the hardware parallelism of the emerging multi-cores systems necessitates concurrent software. Unfortunately, most of the existing mainstream software is sequential in nature. Although one could auto-parallelize a given program, the efficacy ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '09: Proceedings of the 23rd international conference on Supercomputing

June 2009

544 pages

ISBN:9781605584980

DOI:10.1145/1542275

General Chairs:
Michael Gschwind
IBM TJ Watson Research Center, USA
,
Alex Nicolau
UC Irvine, USA
,
Program Chairs:
Valentina Salapura
IBM TJ Watson Research Center, USA
,
José Moreira
IBM TJ Watson Research Center, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '09

Sponsor:

ICS '09: International Conference on Supercomputing

June 8 - 12, 2009

NY, Yorktown Heights, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
558
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu LLiu MWang CWang J(2016)Compile-Time Automatic Synchronization Insertion and Redundant Synchronization Elimination for GPU Kernels2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2016.0112(826-834)Online publication date: Dec-2016
https://doi.org/10.1109/ICPADS.2016.0112
Aloor RNandivada VBhuyan LChong FSarkar V(2015)Unique Worker model for OpenMPProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751238(47-56)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751238
Jääskeläinen PLama CSchnetter ERaiskila KTakala JBerg H(2015)poclInternational Journal of Parallel Programming10.1007/s10766-014-0320-y43:5(752-785)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1007/s10766-014-0320-y
Campanoni SBrownell KKanev SJones TWei GBrooks DYew PZhai AKeckler S(2014)HELIX-RCProceeding of the 41st annual international symposium on Computer architecuture10.5555/2665671.2665705(217-228)Online publication date: 14-Jun-2014
https://dl.acm.org/doi/10.5555/2665671.2665705
Campanoni SBrownell KKanev SJones TWei GBrooks D(2014)HELIX-RCACM SIGARCH Computer Architecture News10.1145/2678373.266570542:3(217-228)Online publication date: 14-Jun-2014
https://dl.acm.org/doi/10.1145/2678373.2665705
Campanoni SBrownell KKanev SJones TWei GBrooks D(2014)HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)10.1109/ISCA.2014.6853215(217-228)Online publication date: Jun-2014
https://doi.org/10.1109/ISCA.2014.6853215
Barik RZhao JSarkar VFensch CO'Boyle MSeznec ABodin F(2013)Interprocedural strength reduction of critical sections in explicitly-parallel programsProceedings of the 22nd international conference on Parallel architectures and compilation techniques10.5555/2523721.2523729(29-40)Online publication date: 7-Oct-2013
https://dl.acm.org/doi/10.5555/2523721.2523729
Nandivada VShirako JZhao JSarkar V(2013)A Transformation Framework for Optimizing Task-Parallel ProgramsACM Transactions on Programming Languages and Systems10.1145/2450136.245013835:1(1-48)Online publication date: 1-Apr-2013
https://dl.acm.org/doi/10.1145/2450136.2450138
Barik RJisheng Zhao Sarkar V(2013)S-CAVE: effective SSD caching to improve virtual machine storage performanceProceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2013.6618801(103-112)Online publication date: Oct-2013
https://doi.org/10.1109/PACT.2013.6618801
August DHuang JBeard SJohnson NJablin T(2013)Automatically exploiting cross-invocation parallelism using runtime informationProceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2013.6495001(1-11)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1109/CGO.2013.6495001
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten