skip to main content
research-article
Public Access

Tapir: Embedding Recursive Fork-join Parallelism into LLVM’s Intermediate Representation

Authors Info & Claims
Published:17 December 2019Publication History
Skip Abstract Section

Abstract

Tapir (pronounced TAY-per) is a compiler intermediate representation (IR) that embeds recursive fork-join parallelism, as supported by task-parallel programming platforms such as Cilk and OpenMP, into a mainstream compiler’s IR. Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations on and across parallel control constructs. Remedying this situation has generally been thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir leverages the “serial-projection property,” which is commonly satisfied by task-parallel programs, to handle the semantics of these programs without an extensive rework of the compiler.

For recursive fork-join programs that satisfy the serial-projection property, Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program’s control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications. For example, to implement Tapir/LLVM, a prototype of Tapir in the LLVM compiler, we added or modified less than 3,000 lines of LLVM’s half-million-line core middle-end functionality.

These changes sufficed to enable LLVM’s existing compiler optimizations for serial code—including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination—to work with parallel control constructs such as parallel loops and Cilk’s Cilk_Spawn keyword. Tapir also supports parallel optimizations, such as loop scheduling, which restructure the parallel control flow of the program. By making use of existing LLVM optimizations and new parallel optimizations, Tapir/LLVM can optimize recursive fork-join programs more effectively than traditional compilation methods. On a suite of 35 Cilk application benchmarks, Tapir/LLVM produces more efficient executables for 30 benchmarks, with faster 18-core running times for 26 of them, compared to a nearly identical compiler that compiles parallel linguistic constructs the traditional way.

References

  1. Shivali Agarwal, Rajkishore Barik, Vivek Sarkar, and Rudrapatna K. Shyamasundar. 2007. May-happen-in-parallel analysis of X10 programs. In Proceedings of PPoPP. 183--193.Google ScholarGoogle Scholar
  2. Kunal Agrawal, Charles E. Leiserson, and Jim Sukha. 2010. Executing task graphs using work-stealing. In Proceedings of IPDPS. 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  3. Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2nd ed.). Addison-Wesley.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jonathan Aldrich, Craig Chambers, EminGun Sirer, and Susan Eggers. 1999. Static analyses for eliminating unnecessary synchronization from Java programs. In Static Analysis, Agostino Cortesi and Gilberto Filé (Eds.). Lecture Notes in Computer Science, Vol. 1694. 19--38.Google ScholarGoogle Scholar
  5. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of SPAA. 119--129.Google ScholarGoogle Scholar
  6. E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Yuan Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and Guansong Zhang. 2009. The design of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst. 20, 3 (2009), 404--418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Rajkishore Barik and Vivek Sarkar. 2009. Interprocedural load elimination for dynamic optimization of parallel programs. In Proceedings of PACT. 41--52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2013. Interprocedural strength reduction of critical sections in explicitly-parallel programs. In Proceedings of PACT. 29--40.Google ScholarGoogle ScholarCross RefCross Ref
  9. Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. 2010. CoreDet: A compiler and runtime system for deterministic multithreaded execution. In Proceedings of ASPLOS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Emery D. Berger, Ting Yang, Tongping Liu, and Gene Novark. 2009. Grace: Safe multithreaded programming for C/C++. In Proceedings of OOPSLA. 81--96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Guy E. Blelloch. 1996. Programming parallel algorithms. Commun. ACM 39, 3 (Mar. 1996).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. In Proceedings of PPoPP. 181--192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1996. Cilk: An efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37, 1 (1996), 55--69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720--748.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Robert L. Bocchino, Jr., Vikram S. Adve, Sarita V. Adve, and Marc Snir. 2009. Parallel programming must be deterministic by default. In Proceedings of HotPar.Google ScholarGoogle Scholar
  16. Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ concurrency memory model. In Proceedings of PLDI. 68--78.Google ScholarGoogle Scholar
  17. Luca Cardelli. 1997. Program fragments, linking, and modularization. In Proceedings of POPL. 266--277.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. 2011. Habanero-Java: The new adventures of old X10. In Proceedings of PPPJ. 51--61.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Prasanth Chatarasi, Jun Shirako, and Vivek Sarkar. 2015. Polyhedral optimizations of explicitly parallel programs. In Proceedings of PACT. 213--226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. John S. Danaher, I.-Ting Angelina Lee, and Charles E. Leiserson. 2008. Programming with exceptions in JCilk. Sci. Comput. Program. 63, 2 (Dec. 2008), 147--171.Google ScholarGoogle Scholar
  21. Joseph Devietti, Brandon Lucia, Luis Ceze, and Mark Oskin. 2009. DMP: Deterministic shared memory multiprocessing. In Proceedings of ASPLOS. 85--96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Joseph Devietti, Jacob Nelson, Tom Bergan, Luis Ceze, and Dan Grossman. 2011. RCDC: A relaxed consistency deterministic computer. In Proceedings of ASPLOS. 67--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Wei Du, Renato Ferreira, and Gagan Agrawal. 2003. Compiler support for exploiting coarse-grained pipelined parallelism. In Proceedings of SC. 8--21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mingdong Feng and Charles E. Leiserson. 1997. Efficient detection of determinacy races in Cilk programs. In Proceedings of SPAA.Google ScholarGoogle Scholar
  25. Mingdong Feng and Charles E. Leiserson. 1999. Efficient detection of determinacy races in Cilk programs. Theory Comput. Syst. 32, 3 (1999), 301--326.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jeremy T. Fineman and Charles E. Leiserson. 2011. Race detectors for Cilk and Cilk++ programs. In Encyclopedia of Parallel Computing, David Padua (Ed.). 1706--1719.Google ScholarGoogle Scholar
  27. The MPI Forum. 1993. MPI: A message passing interface. In Proceedings of Supercomputing. 878--883.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and other Cilk++ hyperobjects. In Proceedings of SPAA. 79--90.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of PLDI. 212--223.Google ScholarGoogle Scholar
  30. GCC Team. 2014. GCC 4.9 Release Series Changes, New Features, and Fixes. Retrieved from https://gcc.gnu.org/gcc-4.9/changes.html.Google ScholarGoogle Scholar
  31. GCC Team. 2015. GOMP—An OpenMP Implementation for GCC. Retrieved from https://gcc.gnu.org/projects/gomp/.Google ScholarGoogle Scholar
  32. P. B. Gibbons. 1989. A more practical PRAM model. In Proceedings of SPAA. 158--168.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Dan Grossman and Ruth E. Anderson. 2012. Introducing parallelism and concurrency in the data structures course. In Proceedings of the Technical Symposium on Computer Science Education (SIGCSE ’12). ACM, New York, NY, 505--510.Google ScholarGoogle Scholar
  34. Dirk Grunwald and Harini Srinivasan. 1993. Data flow equations for explicitly parallel programs. In Proceedings of PPoPP. 159--168.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pablo Halpern. 2012. Strict Fork-Join Parallelism. Technical Report N3409. Intel Corporation.Google ScholarGoogle Scholar
  36. Pablo Halpern and Charles E. Leiserson. 2013. Thread-Local Storage in X-Parallel Computations. Technical Report N3556. Intel Corporation and MIT.Google ScholarGoogle Scholar
  37. Robert H. Halstead, Jr. 1985. Multilisp: A language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7, 4 (Oct. 1985), 501--538.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview scalability analyzer. In Proceedings of SPAA.Google ScholarGoogle Scholar
  39. Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.Google ScholarGoogle Scholar
  40. C. A. R. Hoare. 1961. Algorithm 64: Quicksort. Commun. ACM 4, 7 (1961), 321.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. L. Hochstein, J. Carver, F. Shull, S. Asgari, V. Basili, J. K. Hollingsworth, and M. V. Zelkowitz. 2005. Parallel programmer productivity: A case study of novice parallel programmers. In Proceedings of SC.Google ScholarGoogle Scholar
  42. Derek R. Hower, Polina Dudnik, Mark D. Hill, and David A. Wood. 2011. Calvin: Deterministic or not? Free will to choose. In Proceedings of HPCA. 333--334.Google ScholarGoogle Scholar
  43. Institute of Electrical and Electronic Engineers. [n.d.]. Information Technology—Portable Operating System Interface (POSIX)—Part 1: System Application Program Interface (API) [C Language]. IEEE Standard 1003.1, 1996 Edition.Google ScholarGoogle Scholar
  44. Intel Corporation. 2010. Intel Cilk Plus Application Binary Interface Specification. Document Number: 324512-001US. Retrieved from https://software.intel.com/sites/products/cilk-plus/cilk_plus_abi.pdf.Google ScholarGoogle Scholar
  45. Intel Corporation. 2010. Intel Cilk Plus Language Specification. Document Number: 324396-001US. Retrieved from http://software.intel.com/sites/products/cilk-plus/cilk_plus_language_specification.pdf.Google ScholarGoogle Scholar
  46. Intel Corporation. 2013. Cilk Plus/LLVM. Retrieved from http://cilkplus.github.io/.Google ScholarGoogle Scholar
  47. Intel Corporation 2013. Intel Cilk Plus Language Extension Specification, Version 1.2. Intel Corporation. Retrieved from https://www.cilkplus.org/sites/default/files/open_specifications/Intel_Cilk_plus_lang_spec_1.2.htm.Google ScholarGoogle Scholar
  48. Intel Corporation. 2015. Intel C++ Compiler 16.0 User and Reference Guide.Google ScholarGoogle Scholar
  49. Intel Corporation. 2018. Intel Cilk Plus Samples. Retrieved from https://software.intel.com/en-us/code-samples/intel-compiler/all-samples-and-downloads.Google ScholarGoogle Scholar
  50. Mark C. Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. 2015. A scalable architecture for ordered parallelism. In Proceedings of MICRO. 228--241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Pramod G. Joisha, Robert S. Schreiber, Prithviraj Banerjee, Hans J. Boehm, and Dhruva R. Chakrabarti. 2011. A technique for the effective and automatic reuse of classical compiler optimizations on multithreaded code. In Proceedings of POPL. 623--636.Google ScholarGoogle Scholar
  52. Herbert Jordan, Simone Pellegrini, Peter Thoman, Klaus Kofler, and Thomas Fahringer. 2013. INSPIRE: The Insieme parallel intermediate representation. In Proceedings of PACT. 7--18.Google ScholarGoogle ScholarCross RefCross Ref
  53. Brian W. Kernighan and Dennis M. Ritchie. 1988. The C Programming Language (2nd ed.). Prentice Hall, Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. D. Khaldi, P. Jouvelot, C. Ancourt, and F. Irigoin. 2012. SPIRE, a Sequential to Parallel Intermediate Representation Extension. Technical Report. Technical Report CRI/A-487, MINES ParisTech.Google ScholarGoogle Scholar
  55. Dounia Khaldi, Pierre Jouvelot, François Irigoin, Corinne Ancourt, and Barbara Chapman. 2015. LLVM parallel intermediate representation: Design and evaluation using OpenSHMEM communications. In Proceedings of LLVM. 2:1--2:8.Google ScholarGoogle Scholar
  56. Jens Knoop, Bernhard Steffen, and Jürgen Vollmer. 1996. Parallelism for free: Efficient and optimal bitvector analyses for parallel programs. ACM Trans. Program. Lang. Syst. 18, 3 (May 1996), 268--299.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Leslie Lamport. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. C-28, 9 (Sept. 1979), 690--691.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of CGO. 75--87.Google ScholarGoogle ScholarCross RefCross Ref
  59. Edward A. Lee. 2006. The problem with threads. Computer 39 (2006), 33--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-fly pipeline parallelism. Trans. Parallel Comput. 2, 3, Article 17 (2015), 42 pages.Google ScholarGoogle Scholar
  61. I-Ting Angelina Lee, Aamir Shafi, and Charles E. Leiserson. 2012. Memory-mapping support for reducer hyperobjects. In Proceedings of SPAA. 287--297.Google ScholarGoogle Scholar
  62. Jaejin Lee, Samuel P. Midkiff, and David A. Padua. 1997. Concurrent static single assignment form and constant propagation for explicitly parallel programs. In Proceedings of LCPC. 114--130.Google ScholarGoogle Scholar
  63. Charles E. Leiserson. 2010. The Cilk++ concurrency platform. J. Supercomput. 51, 3 (2010), 244--257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. LLVM Developer List. 2012. [LLVMdev] [cfe-dev] SPIR Provisional Specification Is Now Available in the Khronos Website. Retrieved from http://lists.llvm.org/pipermail/llvm-dev/2012-September/053293.html.Google ScholarGoogle Scholar
  65. LLVM Developer List. 2012. [LLVMdev] [RFC] OpenMP Representation in LLVM IR. Retrieved from http://lists.llvm.org/pipermail/llvm-dev/2012-September/053861.html.Google ScholarGoogle Scholar
  66. LLVM Developer List. 2012. [LLVMdev] [RFC] Parallelization Metadata and Intrinsics in LLVM (for OpenMP, Etc.). Retrieved from http://lists.llvm.org/pipermail/llvm-dev/2012-September/053792.html.Google ScholarGoogle Scholar
  67. LLVM Developer List. 2015. [LLVMdev] LLVM Parallel IR. Retrieved from http://lists.llvm.org/pipermail/llvm-dev/2015-March/083314.html.Google ScholarGoogle Scholar
  68. LLVM Project. 2015. OpenMP: Support for the OpenMP Language. Retrieved from http://openmp.llvm.org/.Google ScholarGoogle Scholar
  69. LLVM Project. 2018. Exception Handling in LLVM. Retrieved from https://llvm.org/docs/ExceptionHandling.html.Google ScholarGoogle Scholar
  70. LLVM Project. 2018. LLVM Language Reference Manual. Retrieved from http://llvm.org/docs/LangRef.html.Google ScholarGoogle Scholar
  71. LLVM Project. 2018. LLVM’s Analysis and Transform Passes. Retrieved from http://llvm.org/docs/Passes.html.Google ScholarGoogle Scholar
  72. Michael McCool, Arch D. Robison, and James Reinders. 2012. Structured Parallel Programming: Patterns for Efficient Computation. Elsevier Science.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Don McCrady. 2008. Avoiding Contention Using Combinable Objects. Microsoft Developer Network blog post. Retrieved from http://blogs.msdn.com/nativeconcurrency/archive/2008/09/25/avoiding-contention-using-combinable-objects.aspx.Google ScholarGoogle Scholar
  74. Samuel P. Midkiff and David A. Padua. 1990. Issues in the optimization of parallel programs. In Proceedings of ICPP. 105--113.Google ScholarGoogle Scholar
  75. Carroll Morgan. 1994. Programming from Specifications (2nd ed.). Prentice Hall International (UK) Ltd.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Joel Moses. 1970. The Function of FUNCTION in LISP or Why the FUNARG Problem Should be Called the Environment Problem. Technical Report memo AI-199. Massachusetts Institute of Technology Artificial Intelligence Laboratory.Google ScholarGoogle Scholar
  77. Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann.Google ScholarGoogle Scholar
  78. Angeles Navarro, Rafael Asenjo, Siham Tabik, and Calin Cascaval. 2009. Analytical modeling of pipeline parallelism. In Proceedings of PACT. 281--290.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Robert H. B. Netzer and Barton P. Miller. 1992. What are race conditions?ACM Lett. Program. Lang. Syst. 1, 1 (1992), 74--88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Diego Novillo, Ron Unrau, and Jonathan Schaeffer. 1998. Concurrent SSA form in the presence of mutual exclusion. In Proceedings of ICPP. 356--364.Google ScholarGoogle ScholarCross RefCross Ref
  81. Marek Olszewski, Jason Ansel, and Saman Amarasinghe. 2009. Kendo: Efficient deterministic multithreading in software. In Proceedings of ASPLOS. 97--108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. OpenMP Architecture Review Board. 2015. OpenMP Application Program Interface, Version 4.5. Retrieved from http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf.Google ScholarGoogle Scholar
  83. Suhas S. Patil. 1970. Closure properties of interconnections of determinate systems. In Record of the Project MAC Conference on Concurrent Systems and Parallel Computation, Jack B. Dennis (Ed.). ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The Tao of parallelism in algorithms. In Proceedings of ACM PLDI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Antoniu Pop and Albert Cohen. 2010. Preserving high-level semantics of parallel programming annotations through the compilation flow of optimizing compilers. In Proceedings of CPC.Google ScholarGoogle Scholar
  86. William Pugh. 1999. Fixing the Java memory model. In Proceedings of JAVA. 89--98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Rolf Rabenseifner, Georg Hager, and Gabriele Jost. 2009. Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In Proceedings of PDP. 427--436.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media, Inc.Google ScholarGoogle Scholar
  89. Arch D. Robison and Ralph E. Johnson. 2010. Three layer cake for shared-memory programming. In Proceedings of ParaPLoP. 5:1--5:8.Google ScholarGoogle Scholar
  90. Erik Ruf. 2000. Effective synchronization removal for Java. In Proceedings of PLDI. 208--218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Radu Rugina and Martin C. Rinard. 2003. Pointer analysis for structured parallel programs. ACM Trans. Program. Lang. Syst. 25, 1 (Jan. 2003), 70--116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Vivek Sarkar. 1998. Analysis and optimization of explicitly parallel programs using the parallel program graph representation. In Proceedings of LCPC. 94--113.Google ScholarGoogle Scholar
  93. Vivek Sarkar and Barbara Simons. 1994. Parallel program graphs and their classification. In Proceedings of LCPC. 633--655.Google ScholarGoogle ScholarCross RefCross Ref
  94. Tao B. Schardl. 2016. Performance Engineering of Multicore Software: Developing a Science of Fast Code for the Post-Moore Era. Ph.D. Dissertation. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA.Google ScholarGoogle Scholar
  95. Tao B. Schardl, Tyler Denniston, Damon Doucet, Bradley C. Kuszmaul, I-Ting Angelina Lee, and Charles E. Leiserson. 2017. The CSI framework for compiler-inserted program instrumentation. Proc. ACM Meas. Anal. Comput. Syst. 1, 2 (Dec. 2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Tao B. Schardl, Bradley C. Kuszmaul, I-Ting Angelina Lee, William M. Leiserson, and Charles E. Leiserson. 2015. The Cilkprof scalability profiler. In Proceedings of SPAA. 89--100.Google ScholarGoogle Scholar
  97. Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding fork-join parallelism into LLVM’s intermediate representation. In Proceedings of PPoPP. 249--265.Google ScholarGoogle Scholar
  98. J. Shirako, D. M. Peixotto, V. Sarkar, and W. N. Scherer. 2009. Phaser accumulators: A new reduction construct for dynamic parallelism. In Proceedings of IPDPS.Google ScholarGoogle Scholar
  99. Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. 2013. Reducing contention through priority updates. In Proceedings of SPAA. 152--163.Google ScholarGoogle Scholar
  100. Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief announcement: The problem-based benchmark suite. In Proceedings of SPAA. 68--70.Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Harini Srinivasan and Dirk Grunwald. 1991. An Efficient Construction of Parallel Static Single Assignment Form for Structured Parallel Programs. Technical Report. Technical Report CU-CS-564-91, University of Colorado at Boulder.Google ScholarGoogle Scholar
  102. Harini Srinivasan, James Hook, and Michael Wolfe. 1993. Static single assignment for explicitly parallel programs. In Proceedings of POPL. 260--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Harini Srinivasan and Michael Wolfe. 1991. Analyzing programs with explicit parallelism. In Proceedings of LCPC. 405--419.Google ScholarGoogle Scholar
  104. Richard M. Stallman and the GCC Developer Community. 2016. Using the GNU Compiler Collection (for GCC version 6.1.0). Free Software Foundation.Google ScholarGoogle Scholar
  105. Guy L. Steele Jr. 1990. Making asynchronous parallelism safe for the world. In Proceedings of POPL. 218--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. George Stelle, William S. Moses, Stephen L. Olivier, and Patrick McCormick. 2017. OpenMPIR: Implementing openmp tasks with tapir. In Proceedings of LLVM-HPC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Bjarne Stroustrup. 2013. The C++ Programming Language (4th ed.). Addison-Wesley.Google ScholarGoogle Scholar
  108. Robert Utterback, Kunal Agrawal, Jeremy T. Fineman, and I-Ting Angelina Lee. 2016. Provably good and practically efficient parallel race detection for fork-join programs. In Proceedings of SPAA. 83--94.Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Viktor Vafeiadis, Thibaut Balabonski, Soham Chakraborty, Robin Morisset, and Francesco Zappa Nardelli. 2015. Common compiler optimisations are invalid in the C11 memory model and what we can do about it. In Proceedings of POPL. 209--220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Eelco Visser. 2001. A survey of rewriting strategies in program transformation systems. Electr. Notes Theoret. Comput. Sci. 57 (2001), 109--143.Google ScholarGoogle ScholarCross RefCross Ref
  111. Martin Wimmer. 2013. Wait-free hyperobjects for task-parallel programming systems. In Proceedings of IPDPS. 803--812.Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Jie Yu and Satish Narayanasamy. 2009. A case for an interleaving constrained shared-memory multi-processor. In Proceedings of ISCA. 325--336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Jisheng Zhao and Vivek Sarkar. 2011. Intermediate language extensions for parallelism. In Proceedings of SPLASH. 329--340.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Tapir: Embedding Recursive Fork-join Parallelism into LLVM’s Intermediate Representation

      Recommendations

      Reviews

      William M. Waite

      A typical compiler has a front end that analyzes the source text and converts it to a language-independent intermediate representation (IR) whose structure and operations support general-purpose optimization in the compiler's middle end. The quality of the optimization depends on the IR's ability to express the intent of source language constructs in a way that the middle end understands and can improve. Although some compilers now support task parallel linguistic constructs, their front ends convert those constructs into more primitive IR sequences. Often, those sequences involve opaque procedure calls that block further optimization. The authors took a different approach with Tapir. They chose to extend the existing LLVM IR to encode logical parallelism and expose opportunities for standard compiler optimizations to work on parallel code. Integration of these extensions was relatively simple, and they turned out to be easy to maintain in response to LLVM changes. The extensions opened parallel code to a wide variety of standard optimizations by avoiding the problems with opaque procedure calls. According to the authors, the Tapir approach is premised on the notion that parallel programs can usually be executed serially and that there is a normative serial execution order for tasks. This allows them to define serial projection: A serial projection transforms a given program to replace parallel constructs with serial constructs. If the serial projection produces one of the valid behaviors of the original program, then the compiler and runtime system can execute parallel tasks serially. The authors argue that the serial projection property is supported by most task parallel programming models, but not necessarily by arbitrary concurrency mechanisms. They therefore strongly advocate that task parallelism and concurrency be treated as separate concepts. The paper describes how Tapir represents logically parallel tasks, how LLVM's analysis is adapted to Tapir programs, and how the authors evaluated Tapir in practice. Each of these topics is covered in detail, with examples and discussion. The authors discuss related work and present a comprehensive retrospective on where they have been and where they are going. I found the paper easy to read and engrossing; it should be accessible to anyone with a basic understanding of code optimization

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Parallel Computing
        ACM Transactions on Parallel Computing  Volume 6, Issue 4
        December 2019
        188 pages
        ISSN:2329-4949
        EISSN:2329-4957
        DOI:10.1145/3372747
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 December 2019
        • Accepted: 1 April 2019
        • Revised: 1 January 2019
        • Received: 1 April 2018
        Published in topc Volume 6, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format