skip to main content
research-article

Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach

Published:19 March 2018Publication History
Skip Abstract Section

Abstract

Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering the adoption of heterogeneous computing. This paper develops a novel approach to automatically detect opportunities for accelerator exploitation. We focus on calculations that are well supported by established APIs: sparse and dense linear algebra, stencil codes and generalized reductions and histograms. We call them idioms and use a custom constraint-based Idiom Description Language (IDL) to discover them within user code. Detected idioms are then mapped to BLAS libraries, cuSPARSE and clSPARSE and two DSLs: Halide and Lift. We implemented the approach in LLVM and evaluated it on the NAS and Parboil sequential C/C++ benchmarks, where we detect 60 idiom instances. In those cases where idioms are a significant part of the sequential execution time, we generate code that achieves 1.26x to over 20x speedup on integrated and external GPUs.

References

  1. Martin S. Alnæs, Anders Logg, Kristian B. Olgaard, Marie E. Rognes, and Garth N. Wells. Unified form language: A domain-specific language for weak formulations of partial differential equations. ACM Trans. Math. Softw., 40(2):9:1--9:37, March 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AMD. clBLAS. https://github.com/clMathLibraries/clBLAS, 2013.Google ScholarGoogle Scholar
  3. José M. Andión. Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures. PhD thesis, University of A Coru na, 2015.Google ScholarGoogle Scholar
  4. Riyadh Baghdadi, Albert Cohen, Tobias Grosser, Sven Verdoolaege, Anton Lokhmotov, Javed Absar, Sven Van Haastregt, Alexey Kravets, and Alastair Donaldson. PENCIL Language Specification. Research Report RR-8706, INRIA, May 2015.Google ScholarGoogle Scholar
  5. Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC'10/ETAPS'10, pages 244--263, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. The polyhedral model is more widely applicable than you think. In International Conference on Compiler Construction, pages 283--303. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kevin J. Brown, Arvind K. Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT '11, pages 89--100, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: Compiling an embedded data parallel language. Technical Report UCB/EECS-2010--124, EECS Department, University of California, Berkeley, Sep 2010.Google ScholarGoogle Scholar
  9. Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. A domain-specific approach to heterogeneous parallelism. SIGPLAN Not., 46(8):35--46, February 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. Accelerating haskell array codes with multicore GPUs. In Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming, DAMP '11, pages 3--14, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Lam Chi-Chung, P Sadayappan, and Rephael Wenger. On optimizing a class of multi-dimensional loops with reduction for parallel execution. Parallel Processing Letters, 7(02):157--168, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  12. Alexander Collins, Dominik Grewe, Vinod Grover, Sean Lee, and Adriana Susnea. Nova: A functional language for data parallelism. In Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY'14, pages 8:8--8:13, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. Tools and Algorithms for the Construction and Analysis of Systems, pages 337--340, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Johannes Doerfert, Kevin Streit, Sebastian Hack, and Zino Benaissa. Polly's polyhedral scheduling in the presence of reductions. CoRR, abs/1505.07716, 2015.Google ScholarGoogle Scholar
  15. Allan L. Fisher and Anwar M. Ghuloum. Parallelizing complex scans and reductions. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, PLDI '94, pages 135--146, New York, NY, USA, 1994. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. Operator language: A program generation framework for fast kernels. In IFIP TC 2 Working Conference on Domain-Specific Languages. Springer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Philip Ginsbach and Michael F. P. O'Boyle. Discovery and exploitation of general reductions: a constraint based approach. In CGO, pages 269--280. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gautam Gupta and Sanjay V Rajopadhye. Simplifying reductions. In POPL, volume 6, pages 30--41, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Gutiérrez, O. Plata, and E. L. Zapata. A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors. In Proceedings of the 14th International Conference on Supercomputing, ICS '00, pages 78--87, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Eladio Gutiérrez, O Plata, and Emilio L Zapata. Optimization techniques for parallel irregular reductions. Journal of systems architecture, 49(3):63--74, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eladio Gutiérrez, Oscar Plata, and Emilio L Zapata. An analytical model of locality-based parallel irregular reductions. Parallel Computing, 34(3):133--157, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. High performance stencil code generation with Lift. In CGO. ACM, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Intel. Math Kernel Library.Google ScholarGoogle Scholar
  24. Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. Automatic CPU-GPU communication management and optimization. In PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Richard Johnson, David Pearson, and Keshav Pingali. The program structure tree: Computing control regions in linear time. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, PLDI '94, pages 171--185, New York, NY, USA, 1994. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017, pages 1--12, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Pierre Jouvelot and Babak Dehbonei. A unified semantic approach for the vectorization and parallelization of generalized reductions. In Proceedings of the 3rd international conference on Supercomputing, pages 186--194. ACM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. Verified lifting of stencil computations. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, pages 711--726, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75--86. IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. SIGPLAN Not., 44(4):101--110, February 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Trevor L. McDonell, Manuel M.T. Chakravarty, Gabriele Keller, and Ben Lippmeier. Optimising purely functional GPU programs. In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming, ICFP '13, pages 49--60, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Charith Mendis, Jeffrey Bosboom, Kevin Wu, Shoaib Kamil, Jonathan Ragan-Kelley, Sylvain Paris, Qin Zhao, and Saman Amarasinghe. Helium: Lifting high-performance stencil kernels from stripped x86 binaries to Halide DSL code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '15, pages 391--402, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. Polymage: Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 429--443, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Saurav Muralidharan, Amit Roy, Mary W. Hall, Michael Garland, and Piyush Rai. Architecture-adaptive code variant tuning. In ASPLOS, pages 325--338. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Flemming Nielson, Hanne R Nielson, and Chris Hankin. Principles of program analysis. Springer, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Nvidia. cuBLAS.Google ScholarGoogle Scholar
  37. Nvidia. Nvidia OpenCL Best Practices Guide, 2011.Google ScholarGoogle Scholar
  38. Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus Püschel. SPIRAL in Scala: Towards the systematic construction of generators for performance libraries. In International Conference on Generative Programming: Concepts & Experiences, GPCE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman P. Amarasinghe. Portable performance on heterogeneous architectures. In ASPLOS, pages 431--444. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Shlomit S. Pinter and Ron Y. Pinter. Program optimization and parallelization using idioms. ACM Trans. Program. Lang. Syst., 16(3):305--327, May 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Bill Pottenger and Rudolf Eigenmann. Idiom recognition in the polaris parallelizing compiler. In Proceedings of the 9th international conference on Supercomputing, pages 444--448. ACM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Vignesh T Ravi, Wenjing Ma, David Chiu, and Gagan Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM international conference on supercomputing, pages 137--146. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Chandan Reddy, Michael Kruse, and Albert Cohen. Reduction drawing: Language constructs and polyhedral compilation for reductions on GPU. In PACT, pages 87--97. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Xavier Redon and Paul Feautrier. Scheduling reductions. In Proceedings of the 8th international conference on Supercomputing, pages 117--125. ACM, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Tiark Rompf and Martin Odersky. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM, 55(6), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In ICFP, pages 205--217. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift: a functional data-parallel IR for high-performance GPU code generation. In CGO, pages 74--85. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '14, pages 65--76, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Toshio Suganuma, Hideaki Komatsu, and Toshio Nakatani. Detection and global optimization of reduction operations for distributed parallel machines. In Proceedings of the 10th international conference on Supercomputing, pages 18--25. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Huo X., Ravi V., and Agrawal G. Porting irregular reductions on heterogeneous CPU-GPU configurations. In Proceedings of the 18th IEEE International Conference on High Performance Computing, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Hao Yu and Lawrence Rauchwerger. An adaptive algorithm selection framework for reduction parallelization. IEEE Transactions on Parallel and Distributed Systems, 17(10):1084--1096, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 53, Issue 2
        ASPLOS '18
        February 2018
        809 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3296957
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2018
          827 pages
          ISBN:9781450349116
          DOI:10.1145/3173162

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 March 2018

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader