research-article

The pochoir stencil compiler

Authors:
Yuan Tang

Fudan University, MIT, Cambridge, MA, USA

Fudan University, MIT, Cambridge, MA, USA
View Profile

,
Rezaul Alam Chowdhury

Boston University, MIT, Cambridge, MA, USA

Boston University, MIT, Cambridge, MA, USA
View Profile

,
Bradley C. Kuszmaul

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

,
Chi-Keung Luk

Intel Corp., Hudson, MA, USA

Intel Corp., Hudson, MA, USA
View Profile

,
Charles E. Leiserson

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architecturesJune 2011Pages 117–128https://doi.org/10.1145/1989493.1989508

Published:04 June 2011Publication History

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

Pages 117–128

ABSTRACT

A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.

References

T. Akutsu. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discrete Applied Mathematics, 104:45--62, 2000. Google ScholarDigital Library
R. Bleck, C. Rooth, D. Hu, and L. T. Smith. Salinity-driven Thermocline Transients in a wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486--1505, 1992.Google ScholarCross Ref
R. G. Brickner, W. George, S. L. Johnsson, and A. Ruttenberg. A stencil compiler for the Connection Machine models CM-2/200. In Proceedings of the Fourth Workshop on Compilers for Parallel Computers, 1993.Google Scholar
M. Bromley, S. Heller, T. McNerney, and G. L. Steele Jr. Fortran at ten Gigaflops: The Connection Machine convolution compiler. In PLDI, pages 145--156, Toronto, Ontario, Canada, June 26-28 1991. Google ScholarDigital Library
C++ Standards Committee. Working draft, standard for programming language C++. available from http://www.open-std.org/jtc1/ sc22/wg21/docs/papers/2011/n3242.pdf, 2011. ISO/IEC Document Number N3242=11-0012.Google Scholar
R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cache-oblivious dynamic programming for bioinformatics. TCBB, 7(3):495--510, July-Sept. 2010. Google ScholarDigital Library
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009. Google ScholarDigital Library
K. Datta. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley, Dec 2009. Google ScholarDigital Library
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 4:1--4:12, Austin, TX, Nov. 15-18 2008. Google ScholarDigital Library
H. Dursun, K.-i. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta. A multilevel parallelization framework for high-order stencil computations. In Euro-Par, pages 642--653, Delft, The Netherlands, Aug. 25-28 2009. Google ScholarDigital Library
H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. In-core optimization of high-order stencil computations. In PDPTA, pages 533--538, Las Vegas, NV, July13-16 2009.Google Scholar
J. F. Epperson. An Introduction to Numerical Methods and Analysis. Wiley-Interscience, 2007.Google Scholar
H. Feshbach and P. Morse. Methods of Theoretical Physics. Feshbach Publishing, 1981.Google Scholar
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS, pages 285--297, New York, NY, Oct. 17-19 1999. Google ScholarDigital Library
M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, pages 361--366, Cambridge, MA, June 20-22 2005. Google ScholarDigital Library
M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems, 45(2):203--233, 2009. Google ScholarDigital Library
M. Gardner. Mathematical Games. Scientific American, 223(4):120--123, 1970.Google ScholarCross Ref
O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705--708, 1982.Google ScholarCross Ref
Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pages 145--156, Santorini, Greece, June 13--15 2010. Google ScholarDigital Library
P. Hudak. Building domain-specific embedded languages. ACM Computing Surveys, 28(4), December 1996. Google ScholarDigital Library
Intel software autotuning tool. http://software.intel.com/en- us/articles/intel-software-autotuning-tool/, 2010.Google Scholar
Intel Corporation. Intel Cilk Plus Language Specification, 2010. Document Number: 324396-001US. Available from http://software.intel.com/sites/products/cilk-plus/ cilk_plus_language_specification.pdf.Google Scholar
C. John. Options, futures, and other derivatives. Prentice Hall, 2006.Google Scholar
S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, pages 1--12, 2010.Google ScholarCross Ref
S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC, pages 51--60, San Jose, CA, 2006. Google ScholarDigital Library
S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSP, pages 36--43, Chicago, IL, June 12 2005. Google ScholarDigital Library
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, San Diego, CA, June 10-13 2007. Google ScholarDigital Library
https://perf.wiki.kernel.org/index.php/Main_Page.Google Scholar
R. Mei, W. Shyy, D. Yu, and L. Luo. Lattice Boltzmann method for 3-D flows with curved boundary. J. of Comput. Phys, 161(2):680--699, 2000. Google ScholarDigital Library
M. Mernik, J. Heering, and A. M. Sloane. When and how to develop domain-specific languages. ACM Computing Surveys, 37:316--344, December 2005. Google ScholarDigital Library
P. Micikevicius. 3D finite difference computation on GPUs using CUDA. In GPPGPU, pages 79--84, Washington, DC, Mar. 8 2009. Google ScholarDigital Library
A. Nakano, R. Kalia, and P. Vashishta. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Computer Physics Communications, 83(2-3):197--214, 1994.Google ScholarCross Ref
A. Nitsure. Implementation and optimization of a cache oblivious lattice boltzmann algorithm. Master's thesis, Institut für Informatic, Friedrich-Alexander-Universität Erlangen-Nürnberg, July 2006.Google Scholar
OpenMP application program interface, version 2.5. OpenMP specification, May 2005.Google Scholar
L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS, pages 1--11, Rome, Italy, May 23-29 2009. Google ScholarDigital Library
S. Peyton Jones. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press, 1998.Google Scholar
H. Prokop. Cache-oblivious algorithms. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, June 1999.Google Scholar
G. Roth, J. Mellor-Crummey, K. Kennedy, and R. G. Brickner. Compiling stencils in high performance Fortran. In SC, pages 1--20, San Jose, CA, Nov. 16-20 1997. ACM. Google ScholarDigital Library
A. Taflove and S. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House Norwood, MA, 2000.Google Scholar
A. van Deursen, P. Klint, and J. Visser. Domain-specific languages: An annotated bibliography. SIGPLAN Not., 35(6):26--36, June 2000. Google ScholarDigital Library
S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In IPDPS, pages 1--14, Miami, FL, Apr. 14-18 2008.Google ScholarCross Ref

Index Terms

The pochoir stencil compiler
1. Mathematics of computing
  1. Mathematical software

Recommendations

gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Read More
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More
Directive-based parallelization of the NIM weather model for GPUs
WACCPD '14: Proceedings of the First Workshop on Accelerator Programming using Directives

The NIM is a performance-portable model that runs on CPU, GPU and MIC architectures with a single source code. The single source plus efficient code design allows application scientists to maintain the Fortran code, while computer scientists optimize ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
June 2011
404 pages
ISBN:9781450307437
DOI:10.1145/1989493
Co-chairs:
Friedhelm Meyer auf der Heide
University of Paderborn, Germany
,
Rajmohan Rajaraman
Northeastern University, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 June 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
C++
cache-oblivious algorithm
cilk
compiler
embedded domain-specific language
multicore
parallel computation
stencil computation
trapezoidal decomposition
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate447of1,461submissions,31%
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 248
  Total Citations
  View Citations
- 835
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The pochoir stencil compiler

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

gpucc: an open-source GPGPU compiler

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Directive-based parallelization of the NIM weather model for GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The pochoir stencil compiler

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

gpucc: an open-source GPGPU compiler

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Directive-based parallelization of the NIM weather model for GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media