skip to main content
10.1145/1542275.1542303acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Synchronization optimizations for efficient execution on multi-cores

Published: 08 June 2009 Publication History

Abstract

Multi-cores are becoming ubiquitous as exemplified by Sun's Niagra-2, Intel's Nehalem and AMD's Sau Paulo octal cores. The number of cores per chip is expected to rise in foreseeable future, as evidenced by the recently announced Intel's 80-core Teraflops Research Chip. Exploiting the parallelism of multicores necessitates concurrent software. One way to parallelize programs, not amenable to auto-parallelization, is via explicit synchronization. The placement of the synchronization primitives has a large bearing on how much thread-level parallelism (TLP) can be achieved. In this paper, we propose novel predication-based and other adjunct synchronization optimizations which facilitate exploitation on higher level of TLP than what can be achieved using the state-of-the-art. We demonstrate the efficacy of our techniques, on a real machine, using real codes, specifically, from the industry-standard SPEC CPU benchmarks and other widely used open source codes such as PostgreSQL. Our results show that the proposed techniques yield significantly higher levels of TLP than the state-of-the-art.

References

[1]
S. F. Lundstrom and G. H. Barnes. A controllable MIMD architectures. In Proceedings of the 1980 International Conference on Parallel Processing, pages 19--27, St. Charles, IL, August 1980.
[2]
U. Banerjee. Dependence Analysis. Kluwer Academic Publishers, Boston, MA, 1997.
[3]
A. Nicolau, G. Li, and A. Kejariwal. Techniques for efficient placement of synchronization primitives. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Raleigh, NC, February 2009.
[4]
OpenMP Specification, version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.
[5]
SPEC CPU Benchmarks. http://www.spec.org/benchmarks.html.
[6]
PostgreSQL. http://www.postgresql.org/.
[7]
A. Kejariwal, X. Tian, W. Li, M. Girkar, S. Kozhukhov, H. Saito, U. Banerjee, A. Nicolau, A. V. Veidenbaum, and C. D. Polychronopoulos. On the performance potential of different types of speculative thread-level parallelism. In Proceedings of the 20th ACM International Conference on Supercomputing, pages 24--35, Cairns, Australia, 2006.
[8]
SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.
[9]
K. Karplus and A. Nicolau. Efficient hardware for multiway jumps and pre-fetches. In Proceedings of the 18th annual workshop on Micro-programming, pages 11--18, 1985.
[10]
D. Kuck. The Structure of Computers and Computations, VOLUME 1. John Wiley and Sons, New York, NY, 1978.
[11]
M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996.
[12]
J. A. Fisher. Trace Scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, C-30(7):478--490, July 1981.
[13]
S. Muchnick. Advanced Compiler Design Implementation. Second edition, 2000.
[14]
R. Johnson and M. Schlansker. Analysis techniques for predicated code. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 100--113, Paris, France, 1996.
[15]
J. W. Sias, W.-M. W. Hwu, and D. I. August. Accurate and efficient predicate analysis with binary decision diagrams. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pages 112--123, Monterey, CA, 2000.
[16]
L. Shen, Z. Wang, and J. Lu. Predicate analysis based on path information. In Advanced Parallel Processing Technologies, LNCS, 2834, pages 147--151, 2003.
[17]
J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Conference Record of the Tenth Annual ACM Symposium on the Principles of Programming Languages, Austin, TX, January 1983.
[18]
J. Park and M. Schlansker. On predicated execution. Technical Report 58--91, Hewlett Packard Laboratories, 1991.
[19]
D. N. Pnevmatikatos and G. S. Sohi. Guarded execution and branch prediction in dynamic ILP processors. SIGARCH Computer Architecture News, 22(2):120--129, 1994.
[20]
D. M. Gillies, D. c. Roy Ju, R. Johnson, and M. Schlansker. Global predicate analysis and its application to register allocation. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 114--125, Paris, France, 1996.
[21]
R. Fitzgerald, T. B. Knoblock, E. Ruf, B. Steensgaard, and D. Tarditi. Marmot: an optimizing compiler for Java. Technical report, Microsoft, November 1998.
[22]
J. Aldrich, C. Chambers, E. G. Sirer, and S. J. Eggers. Static analyses for eliminating unnecessary synchronization from java programs. In Proceedings of the 6th International Symposium on Static Analysis, pages 19--38, 1999.
[23]
A. Heydon and M. Najork. Performance limitations of the java core libraries. In Proceedings of the ACM 1999 conference on Java Grande, pages 35--41, San Francisco, CA, 1999.
[24]
J. Bogda and U. Hölzle. Removing unnecessary synchronization in java. SIGPLAN Notices, 34(10):35--46, 1999.
[25]
D. F. Bacon, R. Konuru, C. Murthy, and M. J. Serrano. Thin locks: Featherweight synchronization for java. ACM SIGPLAN Notices, 39(4):583--595, 2004.
[26]
S. Sridharan, A. Rodrigues, and P. Kogge. Evaluating synchronization techniques for light-weight multithreaded/multicore architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 57--58, San Diego, CA, 2007.
[27]
W. Zhu, V. C Sreedhar, Z. Hu, and G. R. Gao. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th International Symposium on Computer Architecture, pages 35--45, San Diego, CA, 2007.
[28]
A. Kejariwal, X. Tian, H. Saito, W. Li, M. Girkar, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. Lightweight lock-free synchronization methods for multithreading. In Proceedings of the 20th ACM International Conference on Supercomputing, pages 361--371, Cairns, Australia, 2006.
[29]
Z. Ammarguellat and W. Harrison. Automatic recognition of induction variables and recurrence relations by abstract interpretation. In Proceedings of the SIGPLAN'90 Conference on Programming Language Design and Implementation, White Plains, NY, June 1990.
[30]
W. M. Pottenger. Induction variable substitution and reduction recognition in the Polaris parallelizing compiler. Master's thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1995.
[31]
M. P. Gerlek, E. Stoltz, and M. Wolfe. Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form. ACM Transactions on Programming Languages and Systems, 17(1):85--122, 1995.
[32]
M. R. Haghighat and Constantine D. Polychronopoulos. Symbolic analysis for parallelizing compilers. ACM Transactions on Programming Languages and Systems, 18(4):477--518, July 1996.
[33]
P. Wu, A. Cohen, J. Hoe inger, and D. Padua. Monotonic evolution: An alternative to induction variable substitution for dependence analysis. In Proceedings of the 15th ACM International Conference on Supercomputing, pages 78--91, Sorrento, Italy, 2001.
[34]
D. Gries. Compiler Construction for Digital Computers. John Wiley and Sons, 1971.
[35]
A. V. Aho and J. D. Ullman. The theory of parsing, translation and compiling. In Compiling. Prentice-Hall, Englewood Cliffs, NJ, 1973.
[36]
D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. J. Wolfe. Dependence graphs and compiler optimizations. In Conference Record of the Eighth Annual ACM Symposium on the Principles of Programming Languages, Williamsburg, VA, January 1981.
[37]
R. Cytron and J. Ferrante. What's in a name? or the value of renaming for parallelism detection and storage allocation. In Proceedings of the 1987 International Conference on Parallel Processing, St. Charles, IL, August 1987.
[38]
S. Weatherford. High level pattern matching extension to C++ for fortran program manipulation in Polaris. Master's thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1994.
[39]
Wine. http://sourceforge.net/project/showfiles.php?group id=6241.
[40]
SPEC CPU2006. http://www.spec.org/cpu2006.
[41]
MySQL: The world's most popular open source database. http://www.mysql.com/.
[42]
A. Kejariwal and A. Nicolau. Reading list of mutual exclusion, locking, synchronization and concurrent objects. http://www.ics.uci.edu/-akejariw/ConcurrentExecutionReadingList.pdf.
[43]
R. Cytron. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing, pages 836--844, St. Charles, IL, August 1986.
[44]
S. Midkiff and D. Padua. Compiler generated synchronization for DO loops. In Proceedings of the 1986 International Conference on Parallel Processing, pages 544--551, St. Charles, IL, August 1986.
[45]
H. Kasahara, H. Honda, M. Iwata, and M. Hirota. A compilation scheme for macro-dataow computation on hierarchical multiprocessor systems. In Proceedings of the International Conference on Parallel Processing, pages II294--II295, Urbana-Champaign, IL, August 1990.
[46]
M. B. Girkar. Functional Parallelism Theoretical Foundations and Implementation. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, December 1991.
[47]
R. Cytron, M. Hind, and W. Hsieh. Automatic generation of DAG parallelism. In Proceedings of the SIGPLAN'89 Conference on Programming Language Design and Implementation, pages 54--68, Portland, OR, 1989.
[48]
V. Sarkar. Instruction reordering for fork-join parallelism. In Proceedings of the SIGPLAN'90 Conference on Programming Language Design and Implementation, pages 322--336, White Plains, NY, 1990.
[49]
W. N. Scherer III and M. L. Scott. Nonblocking concurrent objects with condition synchronization. In Proceedings of the 18th International Symposium on Distributed Computing, Amsterdam, The Netherlands, Oct 2004.
[50]
F. Fich, D. Hendler, and N. Shavit. On the inherent weakness of conditional synchronization primitives. http://www.cs.tau.ac.il/-afek/Handler-conditionals.pdf.
[51]
M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, 1990.
[52]
P. Diniz and M. Rinard. Synchronization transformations for parallel computing. In Proceedings of the Twenty-fourth Annual ACM Symposium on the Principles of Programming Languages, pages 187--200, Paris, France, 1997.
[53]
P. C. Diniz and M. C. Rinard. Lock coarsening: eliminating lock over-head in automatically parallelized object-based programs. Journal of Parallel and Distributed Computing, 49(2):218--244, 1998.
[54]
M. Rinard. Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives. ACM Transactions on Computer Systems, 17(4):337--371, 1999.
[55]
C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Santa Barbara, CA, July 1995.
[56]
S. S. Bhattacharyya, S. Sriram, and E. A. Lee. Resynchronization of multiprocessor schedules Part I: Fundamental concepts and unbounded latency analysis. Memorandum UCB/ERL M96/55, Electronics Research Laboratory, U. C. Berkeley, October 1996.
[57]
D.-K. Chen and P.-C. Yew. Redundant synchronization elimination for DOACROSS loops. IEEE Transactions on Parallel and Distributed Systems, 10(5):459--470, 1999.
[58]
A. Nicolau, R. Potasman, and H. Wang. Register allocation, renaming and their impact on fine-grain parallelism. In Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing, pages 218--235, 1991.
[59]
A. Moshovos and G. S. Sohi. Streamlining inter-operation memory communication via data dependence prediction. In Proceedings of the 30th International Symposium of Microarchitecture, pages 235--245, Research Triangle Park, NC, 1997.
[60]
G. S. Tyson and T. M. Austin. Memory renaming: Fast, early and accurate processing of memory communication. International Journal of Parallel Programming, 27(5):357--380, 1999.
[61]
G. Reinman, B. Calder, D. Tullsen, G. Tyson, and T. Austin. Classifying load and store instructions for memory renaming. In Proceedings of the 13th ACM International Conference on Supercomputing, pages 399--407, Rhodes, Greece, 1999.
[62]
K. Knobe and V. Sarkar. Array SSA form and its use in parallelization. pages 107--120, San Diego, CA, 1998.
[63]
J.-F. Collard. Array SSA for explicitly parallel programs. In Proceedings of the 5th International Euro-Par Conference, pages 383--390, Toulouse, France, August 1999.
[64]
S. Rus, G. He, C. Alias, and L. Rauchwerger. Region array ssa. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, pages 43--52, Seattle, WA, 2006.

Cited By

View all
  • (2016)Compile-Time Automatic Synchronization Insertion and Redundant Synchronization Elimination for GPU Kernels2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2016.0112(826-834)Online publication date: Dec-2016
  • (2015)Unique Worker model for OpenMPProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751238(47-56)Online publication date: 8-Jun-2015
  • (2015)poclInternational Journal of Parallel Programming10.1007/s10766-014-0320-y43:5(752-785)Online publication date: 1-Oct-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '09: Proceedings of the 23rd international conference on Supercomputing
June 2009
544 pages
ISBN:9781605584980
DOI:10.1145/1542275
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. code motion
  2. compilers
  3. multithreading
  4. synchronization

Qualifiers

  • Research-article

Conference

ICS '09
Sponsor:
ICS '09: International Conference on Supercomputing
June 8 - 12, 2009
NY, Yorktown Heights, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2016)Compile-Time Automatic Synchronization Insertion and Redundant Synchronization Elimination for GPU Kernels2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2016.0112(826-834)Online publication date: Dec-2016
  • (2015)Unique Worker model for OpenMPProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751238(47-56)Online publication date: 8-Jun-2015
  • (2015)poclInternational Journal of Parallel Programming10.1007/s10766-014-0320-y43:5(752-785)Online publication date: 1-Oct-2015
  • (2014)HELIX-RCProceeding of the 41st annual international symposium on Computer architecuture10.5555/2665671.2665705(217-228)Online publication date: 14-Jun-2014
  • (2014)HELIX-RCACM SIGARCH Computer Architecture News10.1145/2678373.266570542:3(217-228)Online publication date: 14-Jun-2014
  • (2014)HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)10.1109/ISCA.2014.6853215(217-228)Online publication date: Jun-2014
  • (2013)Interprocedural strength reduction of critical sections in explicitly-parallel programsProceedings of the 22nd international conference on Parallel architectures and compilation techniques10.5555/2523721.2523729(29-40)Online publication date: 7-Oct-2013
  • (2013)A Transformation Framework for Optimizing Task-Parallel ProgramsACM Transactions on Programming Languages and Systems10.1145/2450136.245013835:1(1-48)Online publication date: 1-Apr-2013
  • (2013)S-CAVE: effective SSD caching to improve virtual machine storage performanceProceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2013.6618801(103-112)Online publication date: Oct-2013
  • (2013)Automatically exploiting cross-invocation parallelism using runtime informationProceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2013.6495001(1-11)Online publication date: 23-Feb-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media