research-article

High performance stencil code generation with Lift

Authors:

Bastian Hagedorn,

Larisa Stoltzfus,

Michel Steuwer,

Sergei Gorlatch,

Christophe DubachAuthors Info & Claims

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 100 - 112

https://doi.org/10.1145/3168824

Published: 24 February 2018 Publication History

Abstract

Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers.

Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains.

This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes.

References

[1]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan RaganKelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In PACT ’14. ACM, 303–316.

Digital Library

[2]

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The Landscape Of Parallel Computing Research: A View From Berkeley . Technical Report. UCB/EECS-2006-183, EECS Department, University of California, Berkeley.

[3]

Olivier Aumage, Denis Barthou, and Alexandre Honorat. 2016. A Stencil DSEL For Single Code Accelerated Computing With SYCL. In SYCL 2016 (Workshop) at ACM SIGPLAN PPoPP .

[4]

Peter Bastian, Markus Blatt, Christian Engwer, Andreas Dedner, Robert Klöfkorn, S Kuttanikkad, Mario Ohlberger, and Oliver Sander. 2006. The Distributed And Unified Numerics Environment (DUNE). In Proc. Of The 19th Symposium On Simulation Technique In Hannover .

[5]

Tobias Brandvik and Graham Pullan. 2010. SBLOCK: A Framework For Efficient Stencil-Based PDE Solvers On Multi-Core Platforms. In CIT 2010 . IEEE, 1181–1188.

Digital Library

[6]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite For Heterogeneous Computing. In IISWC 2009. IEEE, 44–54.

Digital Library

[7]

Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. PATUS: A Code Generation And Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures. In IPDPS. IEEE, 676–687.

Digital Library

[8]

Milosz Ciznicki, Michal Kulczewski, Piotr Kopta, and Krzysztof Kurowski. 2016. Scaling The GCR Solver Using A High-Level Stencil Framework On Multi-And Many-Core Architectures. In Parallel Processing And Applied Mathematics . Springer, 594–606.

[9]

Murray I Cole. 1988. Algorithmic Skeletons: A Structured Approach To The Management Of Parallel Computation . Ph.D. Dissertation. University of Edinburgh.

Digital Library

[10]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S Meredith, Philip C Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings Of The 3rd Workshop On General-Purpose Computation On Graphics Processing Units . ACM, 63–74.

Digital Library

[11]

Usman Dastgeer and Christoph Kessler. 2012. A Performance-Portable Generic Component For 2D Convolution Computations On GPU-Based Systems. In MULTIPROG Workshop at HiPEAC-2012. 1–12.

[12]

Fabian Dütsch, Karim Djelassi, Michael Haidl, and Sergei Gorlatch. 2014. HLSF: A High-Level; C++-Based Framework For Stencil Computations On Accelerators. In Proceedings Of The Second Workshop On Optimizing Stencil Computations . ACM, 41–4.

Digital Library

[13]

Johan Enmyren and Christoph W Kessler. 2010. SkePU: A MultiBackend Skeleton Programming Library For Multi-GPU Systems. In Proceedings Of The Fourth International Workshop On High-Level Parallel Programming And Applications . ACM, 5–14.

Digital Library

[14]

Thomas L Falch and Anne C Elster. 2016. ImageCL: An Image Processing Language For Performance Portability On Heterogeneous Systems. arXiv preprint arXiv:1605.06399 (2016).

[15]

Matteo Frigo and Volker Strumpen. 2005. Cache Oblivious Stencil Computations. In ICS 2005. ACM, 361–366.

Digital Library

[16]

Joseph D Garvey. 2015. Automatic Performance Tuning Of Stencil Computations On Graphics Processing Units . Ph.D. Dissertation. University of Toronto.

[17]

Tobias Grosser, Albert Cohen, Paul HJ Kelly, J Ramanujam, P Sadayappan, and Sven Verdoolaege. 2013. Split Tiling For GPUs: Automatic Parallelization Using Trapezoidal Tiles. In Proceedings Of The 6th Workshop On General Purpose Processor Using Graphics Processing Units . ACM, 24–31.

Digital Library

[18]

Tobias Grosser, Sven Verdoolaege, Albert Cohen, and P Sadayappan. 2014. The Relation Between Diamond Tiling And Hexagonal Tiling. Parallel Processing Letters 24, 03 (2014).

[19]

Jia Guo, Ganesh Bikshandi, Basilio B Fraguela, and David Padua. 2009. Writing Productive Stencil Codes With Overlapped Tiling. Concurrency and Computation: Practice and Experience 21, 1 (2009), 25–39.

Digital Library

[20]

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2013. A Stencil Compiler For Short-Vector SIMD Architectures. In ICS 2013. ACM, 13–24.

Digital Library

[21]

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An Auto-Tuning Framework For Parallel Multicore Stencil Computations. In IPDPS 2010. IEEE, 1–12.

[22]

Shoaib Kamil, Derrick Coetzee, Scott Beamer, Henry Cook, Ekaterina Gonina, Jonathan Harper, Jeffrey Morlan, and Armando Fox. 2012. Portable Parallel Performance from Sequential, Productive, Embedded Domain-specific Languages. In PPoPP 2012. ACM, 303–304.

Digital Library

[23]

DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-Level Tiling: M For The Price Of One. In SC 2007. ACM, 51.

Digital Library

[24]

Herbert Kuchen. 2002. A Skeleton Library. In Euro-Par 2002. Springer, 620–629.

Digital Library

[25]

Michael Lesniak. 2010. PASTHA: Parallelizing Stencil Calculations In Haskell. In Proceedings Of The 5th ACM SIGPLAN Workshop On Declarative Aspects Of Multicore Programming . ACM, 5–14.

Digital Library

[26]

Tareq Malas, Georg Hager, Hatem Ltaief, and David Keyes. 2015. MultiDimensional Intra-Tile Parallelization For Memory-Starved Stencil Computations. arXiv preprint arXiv:1510.04995 (2015).

[27]

Azamat Mametjanov, Daniel Lowell, Ching-Chen Ma, and Boyana Norris. 2012. Autotuning Stencil-Based Computations On GPUs. In CLUSTER 2012 . IEEE, 266–274.

Digital Library

[28]

Naoya Maruyama and Takayuki Aoki. 2014. Optimizing Stencil Computations For NVIDIA Kepler GPUs. In Proceedings Of The 1st International Workshop On High-Performance Stencil Computations, Vienna . 89–95.

[29]

Trevor L. McDonell, Manuel M.T. Chakravarty, Gabriele Keller, and Ben Lippmeier. 2013. Optimising Purely Functional GPU Programs. In ICFP 2013 . ACM, New York, NY, USA, 49–60.

Digital Library

[30]

Richard Membarth, Frank Hannig, Jürgen Teich, and Harald Köstler. 2012. Towards Domain-Specific Computing For Stencil Codes In HPC. In SCC 2012. IEEE, 1133–1138.

Digital Library

[31]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In ASPLOS 2015 . ACM, New York, NY, USA, 429–443.

Digital Library

[32]

Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D Blocking Optimization For Stencil Computations On Modern CPUs And GPUs. In SC 2010. IEEE Computer Society, 1–13.

Digital Library

[33]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language And Compiler For Optimizing Parallelism, Locality, And Recomputation In Image Processing Pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.

Digital Library

[34]

Ari Rasch, Michael Haidl, and Sergei Gorlatch. 2017. ATF: A Generic Auto-Tuning Framework. In HPCC. IEEE.

[35]

Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In PACT 2016. ACM, 99–111.

Digital Library

[36]

Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noël Pouchet, and P. Sadayappan. 2016. Effective Resource Management for Enhancing Performance of 2D and 3D Stencils on GPUs. In GPGPU 2016. ACM, New York, NY, USA, 92–102.

Digital Library

[37]

Lakshminarayanan Renganarayana, Manjukumar Harthikote-Matha, Rinku Dewri, and Sanjay Rajopadhye. 2007. Towards Optimal MultiLevel Tiling For Stencil Computations. In IPDPS 2007. IEEE, 1–10.

[38]

Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code Using Rewrite Rules: From High-Level Functional Expressions To High-Performance OpenCL Code. In ICFP. ACM, 205–217.

Digital Library

[39]

Michel Steuwer, Michael Haidl, Stefan Breuer, and Sergei Gorlatch. 2014. High-Level Programming Of Stencil Computations On MultiGPU Systems Using The SkelCL Library. Parallel Processing Letters 24, 03 (2014), 1441005.

[40]

Michel Steuwer, Philipp Kegel, and Sergei Gorlatch. 2011. SkelCL - A Portable Skeleton Library For High-Level Gpu Programming. In Parallel And Distributed Processing Workshops And Phd Forum (IPDPSW), 2011 IEEE International Symposium On . IEEE, 1176–1182.

Digital Library

[41]

Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2016. Matrix Multiplication Beyond Auto-Tuning: Rewrite-Based GPU Code generation. In CASES. ACM, 15:1–15:10.

Digital Library

[42]

Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. Lift: A Functional Data-Parallel IR For High-Performance GPU Code generation. In CGO. ACM, 74–85.

[43]

Larisa Stoltzfus, Alan Gray, Christophe Dubach, and Stefan Bilbao. 2017. Performance Portability For Room Acoustics Simulations.

[44]

Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache Accurate Time Skewing In Iterative Stencil Computations. In ICPP. IEEE, 571–581.

Digital Library

[45]

Arvind K Sujeeth, Kevin J Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A Compiler Architecture For Performance-Oriented Embedded DomainSpecific Languages. TECS (2014), 134.

Digital Library

[46]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C Kuszmaul, Chi-Keung Luk, and Charles E Leiserson. 2011. The Pochoir Stencil Compiler. In SPAA . ACM, 117–128.

Digital Library

[47]

Abhishek Udupa, R Govindarajan, and Matthew J Thazhuthaveetil. 2009. Software Pipelined Execution Of Stream Programs On GPUs. In CGO . IEEE, 200–209.

Digital Library

[48]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages.

Digital Library

[49]

Craig Jonathan Webb. 2014. PArallel COmputation TEchniques For VIrtual ACoustics And PHysical MOdelling SYnthesis. (2014).

[50]

Gerhard Wellein, Georg Hager, Thomas Zeiser, Markus Wittmann, and Holger Fehske. 2009. Efficient Temporal Blocking For Stencil Computations By Multicore-Aware Wavefront Parallelization. In COMPSAC, Vol. 1. IEEE, 579–586.

Digital Library

[51]

Markus Wittmann, Georg Hager, and Gerhard Wellein. 2010. MulticoreAware Parallel Temporal Blocking Of Stencil Codes For Shared And Distributed Memory. In IPDPSW. IEEE, 1–7.

[52]

Xing Zhou. 2013. Tiling Optimizations For Stencil Computations. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3325630
Mankad HMonil MRao SColella PVan Straalen BFranchetti FVetter J(2024)A Performance-Portable MultiGPU Implementation of 3D Euler Equations using ProtoX and IRISProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00215(1723-1731)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00215
Show More Cited By

Index Terms

High performance stencil code generation with Lift
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Stencil computation is an indispensable building block of many scientific applications and is widely used by the numerical solvers of partial differential equations (PDEs). Due to the complex computation patterns of different stencils and the various ...
Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
ICFP '15

Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension ...
High-performance code generation for stencil computations on GPU architectures
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

February 2018

377 pages

ISBN:9781450356176

DOI:10.1145/3179541

General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

EPSRC

Conference

CGO '18

Sponsor:

CGO '18: 16th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

54
Total Citations
View Citations
836
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)7

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3325630
Mankad HMonil MRao SColella PVan Straalen BFranchetti FVetter J(2024)A Performance-Portable MultiGPU Implementation of 3D Euler Equations using ProtoX and IRISProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00215(1723-1731)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00215
Liu XYang XMa KLiu SZhang KYang HLiu YLuan ZQian D(2024)Moirae: Generating High-Performance Composite Stencil Programs with Global OptimizationsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00026(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00026
Chang CLiang YWu SRoy D(2024)SV2-SQL: a text-to-SQL transformation mechanism based on BERT models for slot filling, value extraction, and verificationMultimedia Systems10.1007/s00530-023-01201-y30:1Online publication date: 16-Jan-2024
https://dl.acm.org/doi/10.1007/s00530-023-01201-y
Dutta AAlcaraz JTehraniJamsaz ACesar ESikora AJannesari AButt AMi NChard K(2023)Performance Optimization using Multimodal Modeling and Heterogeneous GNNProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592984(45-57)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592984
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th ACM International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593716
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th ACM International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593705
Li MLiu YChen BYang HLuan ZQian D(2023)Building a domain-specific compiler for emerging processors with a reusable approachScience China Information Sciences10.1007/s11432-022-3727-667:1Online publication date: 27-Dec-2023
https://doi.org/10.1007/s11432-022-3727-6
de Castro MSantamaria-Valenzuela ITorres YGonzalez-Escribano ALlanos D(2023)EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUsThe Journal of Supercomputing10.1007/s11227-022-05040-y79:9(9409-9442)Online publication date: 14-Jan-2023
https://dl.acm.org/doi/10.1007/s11227-022-05040-y
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten