research-article

Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach

Authors:
Philip Ginsbach

The University of Edinburgh, Edinburgh, United Kingdom

The University of Edinburgh, Edinburgh, United Kingdom
View Profile

,
Toomas Remmelg

The University of Edinburgh, Edinburgh, United Kingdom

The University of Edinburgh, Edinburgh, United Kingdom
View Profile

,
Michel Steuwer

University of Glasgow, Glasgow, United Kingdom

University of Glasgow, Glasgow, United Kingdom
View Profile

,
Bruno Bodin

The University of Edinburgh, Edinburgh, United Kingdom

The University of Edinburgh, Edinburgh, United Kingdom
View Profile

,
Christophe Dubach

The University of Edinburgh, Edinburgh, United Kingdom

The University of Edinburgh, Edinburgh, United Kingdom
View Profile

,
Michael F. P. O'Boyle

The University of Edinburgh, Edinburgh, United Kingdom

The University of Edinburgh, Edinburgh, United Kingdom
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 53 Issue 2February 2018pp 139–153https://doi.org/10.1145/3296957.3173182

Published:19 March 2018Publication History

ACM SIGPLAN Notices

Abstract

Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering the adoption of heterogeneous computing. This paper develops a novel approach to automatically detect opportunities for accelerator exploitation. We focus on calculations that are well supported by established APIs: sparse and dense linear algebra, stencil codes and generalized reductions and histograms. We call them idioms and use a custom constraint-based Idiom Description Language (IDL) to discover them within user code. Detected idioms are then mapped to BLAS libraries, cuSPARSE and clSPARSE and two DSLs: Halide and Lift. We implemented the approach in LLVM and evaluated it on the NAS and Parboil sequential C/C++ benchmarks, where we detect 60 idiom instances. In those cases where idioms are a significant part of the sequential execution time, we generate code that achieves 1.26x to over 20x speedup on integrated and external GPUs.

References

Martin S. Alnæs, Anders Logg, Kristian B. Olgaard, Marie E. Rognes, and Garth N. Wells. Unified form language: A domain-specific language for weak formulations of partial differential equations. ACM Trans. Math. Softw., 40(2):9:1--9:37, March 2014. Google ScholarDigital Library
AMD. clBLAS. https://github.com/clMathLibraries/clBLAS, 2013.Google Scholar
José M. Andión. Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures. PhD thesis, University of A Coru na, 2015.Google Scholar
Riyadh Baghdadi, Albert Cohen, Tobias Grosser, Sven Verdoolaege, Anton Lokhmotov, Javed Absar, Sven Van Haastregt, Alexey Kravets, and Alastair Donaldson. PENCIL Language Specification. Research Report RR-8706, INRIA, May 2015.Google Scholar
Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC'10/ETAPS'10, pages 244--263, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. The polyhedral model is more widely applicable than you think. In International Conference on Compiler Construction, pages 283--303. Springer, 2010. Google ScholarDigital Library
Kevin J. Brown, Arvind K. Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT '11, pages 89--100, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarDigital Library
Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: Compiling an embedded data parallel language. Technical Report UCB/EECS-2010--124, EECS Department, University of California, Berkeley, Sep 2010.Google Scholar
Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. A domain-specific approach to heterogeneous parallelism. SIGPLAN Not., 46(8):35--46, February 2011. Google ScholarDigital Library
Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. Accelerating haskell array codes with multicore GPUs. In Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming, DAMP '11, pages 3--14, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Lam Chi-Chung, P Sadayappan, and Rephael Wenger. On optimizing a class of multi-dimensional loops with reduction for parallel execution. Parallel Processing Letters, 7(02):157--168, 1997.Google ScholarCross Ref
Alexander Collins, Dominik Grewe, Vinod Grover, Sean Lee, and Adriana Susnea. Nova: A functional language for data parallelism. In Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY'14, pages 8:8--8:13, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. Tools and Algorithms for the Construction and Analysis of Systems, pages 337--340, 2008. Google ScholarDigital Library
Johannes Doerfert, Kevin Streit, Sebastian Hack, and Zino Benaissa. Polly's polyhedral scheduling in the presence of reductions. CoRR, abs/1505.07716, 2015.Google Scholar
Allan L. Fisher and Anwar M. Ghuloum. Parallelizing complex scans and reductions. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, PLDI '94, pages 135--146, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. Operator language: A program generation framework for fast kernels. In IFIP TC 2 Working Conference on Domain-Specific Languages. Springer, 2009. Google ScholarDigital Library
Philip Ginsbach and Michael F. P. O'Boyle. Discovery and exploitation of general reductions: a constraint based approach. In CGO, pages 269--280. ACM, 2017. Google ScholarDigital Library
Gautam Gupta and Sanjay V Rajopadhye. Simplifying reductions. In POPL, volume 6, pages 30--41, 2006. Google ScholarDigital Library
E. Gutiérrez, O. Plata, and E. L. Zapata. A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors. In Proceedings of the 14th International Conference on Supercomputing, ICS '00, pages 78--87, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
Eladio Gutiérrez, O Plata, and Emilio L Zapata. Optimization techniques for parallel irregular reductions. Journal of systems architecture, 49(3):63--74, 2003. Google ScholarDigital Library
Eladio Gutiérrez, Oscar Plata, and Emilio L Zapata. An analytical model of locality-based parallel irregular reductions. Parallel Computing, 34(3):133--157, 2008. Google ScholarDigital Library
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. High performance stencil code generation with Lift. In CGO. ACM, 2018. Google ScholarDigital Library
Intel. Math Kernel Library.Google Scholar
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. Automatic CPU-GPU communication management and optimization. In PLDI, 2011. Google ScholarDigital Library
Richard Johnson, David Pearson, and Keshav Pingali. The program structure tree: Computing control regions in linear time. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, PLDI '94, pages 171--185, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017, pages 1--12, 2017. Google ScholarDigital Library
Pierre Jouvelot and Babak Dehbonei. A unified semantic approach for the vectorization and parallelization of generalized reductions. In Proceedings of the 3rd international conference on Supercomputing, pages 186--194. ACM, 1989. Google ScholarDigital Library
Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. Verified lifting of stencil computations. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, pages 711--726, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75--86. IEEE, 2004. Google ScholarDigital Library
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. SIGPLAN Not., 44(4):101--110, February 2009. Google ScholarDigital Library
Trevor L. McDonell, Manuel M.T. Chakravarty, Gabriele Keller, and Ben Lippmeier. Optimising purely functional GPU programs. In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming, ICFP '13, pages 49--60, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
Charith Mendis, Jeffrey Bosboom, Kevin Wu, Shoaib Kamil, Jonathan Ragan-Kelley, Sylvain Paris, Qin Zhao, and Saman Amarasinghe. Helium: Lifting high-performance stencil kernels from stripped x86 binaries to Halide DSL code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '15, pages 391--402, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. Polymage: Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 429--443, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
Saurav Muralidharan, Amit Roy, Mary W. Hall, Michael Garland, and Piyush Rai. Architecture-adaptive code variant tuning. In ASPLOS, pages 325--338. ACM, 2016. Google ScholarDigital Library
Flemming Nielson, Hanne R Nielson, and Chris Hankin. Principles of program analysis. Springer, 2015. Google ScholarDigital Library
Nvidia. cuBLAS.Google Scholar
Nvidia. Nvidia OpenCL Best Practices Guide, 2011.Google Scholar
Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus Püschel. SPIRAL in Scala: Towards the systematic construction of generators for performance libraries. In International Conference on Generative Programming: Concepts & Experiences, GPCE, 2013. Google ScholarDigital Library
Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman P. Amarasinghe. Portable performance on heterogeneous architectures. In ASPLOS, pages 431--444. ACM, 2013. Google ScholarDigital Library
Shlomit S. Pinter and Ron Y. Pinter. Program optimization and parallelization using idioms. ACM Trans. Program. Lang. Syst., 16(3):305--327, May 1994. Google ScholarDigital Library
Bill Pottenger and Rudolf Eigenmann. Idiom recognition in the polaris parallelizing compiler. In Proceedings of the 9th international conference on Supercomputing, pages 444--448. ACM, 1995. Google ScholarDigital Library
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
Vignesh T Ravi, Wenjing Ma, David Chiu, and Gagan Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM international conference on supercomputing, pages 137--146. ACM, 2010. Google ScholarDigital Library
Chandan Reddy, Michael Kruse, and Albert Cohen. Reduction drawing: Language constructs and polyhedral compilation for reductions on GPU. In PACT, pages 87--97. ACM, 2016. Google ScholarDigital Library
Xavier Redon and Paul Feautrier. Scheduling reductions. In Proceedings of the 8th international conference on Supercomputing, pages 117--125. ACM, 1994. Google ScholarDigital Library
Tiark Rompf and Martin Odersky. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM, 55(6), 2012. Google ScholarDigital Library
Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In ICFP, pages 205--217. ACM, 2015. Google ScholarDigital Library
Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift: a functional data-parallel IR for high-performance GPU code generation. In CGO, pages 74--85. ACM, 2017. Google ScholarDigital Library
Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '14, pages 65--76, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
Toshio Suganuma, Hideaki Komatsu, and Toshio Nakatani. Detection and global optimization of reduction operations for distributed parallel machines. In Proceedings of the 10th international conference on Supercomputing, pages 18--25. ACM, 1996. Google ScholarDigital Library
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4), 2013. Google ScholarDigital Library
Huo X., Ravi V., and Agrawal G. Porting irregular reductions on heterogeneous CPU-GPU configurations. In Proceedings of the 18th IEEE International Conference on High Performance Computing, 2011. Google ScholarDigital Library
Hao Yu and Lawrence Rauchwerger. An adaptive algorithm selection framework for reduction parallelization. IEEE Transactions on Parallel and Distributed Systems, 17(10):1084--1096, 2006. Google ScholarDigital Library

Index Terms

Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Domain specific languages

Recommendations

Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering ...
Read More
Technical Section: Benchmarking and implementation of probability-based simulations on programmable graphics cards

The latest graphics processing units (GPUs) are reported to reach up to 200 billion floating point operations per second (200Gflops (Spode's Abode, GeForce FX Preview (NV30), Spode, November (2002), Internet address (accessed on 10/2003): http://...
Read More
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 53, Issue 2
ASPLOS '18
February 2018
809 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296957
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
March 2018
827 pages
ISBN:9781450349116
DOI:10.1145/3173162
General Chairs:
Xipeng Shen
North Carolina State University, USA
,
James Tuck
North Carolina State University, USA
,
Program Chairs:
Ricardo Bianchini
Microsoft Research, USA
,
Vivek Sarkar
Georgia Institute of Technology, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 March 2018
Check for updates
Author Tags
computer systems organization
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 592
  Total Downloads
- Downloads (Last 12 months)59
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach

Technical Section: Benchmarking and implementation of probability-based simulations on programmable graphics cards

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory