research-article

Algorithm 942: Semi-Stencil

Authors:
Raúl de la Cruz

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Mauricio Araya-Polo

Repsol USA

Repsol USA
View Profile

Authors Info & Claims

ACM Transactions on Mathematical Software Volume 40 Issue 3Article No.: 23pp 1–39https://doi.org/10.1145/2591006

Published:01 April 2014Publication History

ACM Transactions on Mathematical Software

Abstract

Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, such as geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire computational domain, thus solving the differential operators. In general terms, the stencil computation consists of a weighted accumulation of the contribution of neighbor points along the cartesian axis. Therefore, optimizing stencil computations is crucial in reducing the application execution time.

Stencil computation performance is bounded by two main factors: the memory access pattern and the inefficient reuse of the accessed data. We propose a novel algorithm, named Semi-stencil, that tackles these two problems. The main idea behind this algorithm is to change the way in which the stencil computation progresses within the computational domain. Instead of accessing all required neighbors and adding all their contributions at once, the Semi-stencil algorithm divides the computation into several updates. Then, each update gathers half of the axis neighbors, partially computing at the same time the stencil in a set of closely located points. As Semi-stencil progresses through the domain, the stencil computations are completed on precomputed points. This computation strategy improves the memory access pattern and efficiently reuses the accessed data.

Our initial target architecture was the Cell/B.E., where the Semi-stencil in a SPE was 44% faster than the naive stencil implementation. Since then, we have continued our research on emerging multicore architectures in order to assess and extend this work on homogeneous architectures. The experiments presented combine the Semi-stencil strategy with space- and time-blocking algorithms used in hierarchical memory architectures. Two x86 (Intel Nehalem and AMD Opteron) and two POWER (IBM POWER6 and IBM BG/P) platforms are used as testbeds, where the best improvements for a 25-point stencil range from 1.27 to 1.76× faster. The results show that this novel strategy is a feasible optimization method which may be integrated into auto-tuning frameworks. Also, since all current architectures are multicore based, we have introduced a brief section where scalability results on IBM POWER7-, Intel Xeon-, and MIC-based systems are presented. In a nutshell, the algorithm scales as well as or better than other stencil techniques. For instance, the scalability of Semi-stencil on MIC for a certain testcase reached 93.8 × over 244 threads.

Supplemental Material

Available for Download

zip

942.zip (69.1 KB)

Software for Semi-Stencil

References

Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. 1995. Software pipelining. ACM Comput. Surv. 27, 3, 367--432. DOI:http://dx.doi.org/10.1145/212094.212131. Google ScholarDigital Library
Jose L. Alonso, Xavier Andrade, Pablo Echenique, Fernando Falceto, Diego Prada-Gracia, and Angel Rubio. 2008. Efficient formalism for large-scale ab initio molecular dynamics based on time-dependent density functional theory. Phys. Rev. Lett. 101, 9, 1--4. DOI:http://dx.doi.org/10.1103/PhysRevLett.101.096403.Google ScholarCross Ref
ANAG. 2012. Chombo software package for amr applications. Applied Numerical Algorithms Group (ANAG), Lawrence Berkeley National Laboratory, Berkeley, CA. http://seesar.lbl.gov/anag/software.html.Google Scholar
Mauricio Araya-Polo, Felix Rubio, Mauricio Hanzich, Raúl de la Cruz, Jose M. Cela, and Daniele P. Scarpazza. 2008. 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors. Sci. Program. Cell Process. 17, 1--2, 185--198. http://dl.acm.org/citation.cfm?id=1507443.1507449. Google ScholarDigital Library
Axel Brandenburg. 2003. Computational aspects of astrophysical mhd and turbulence. In The Fluid Mechanics of Astrophysics and Geophysics, Vol. 9, Taylor and Francis, London, 269--344. http://arxiv.org/abs/astro-ph/010949.Google Scholar
David Callahan, Ken Kennedy, and Allan Porterfield. 1991. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’91). ACM Press, New York, 40--52. DOI:http://dx.doi.org/10.1145/106972.106979. Google ScholarDigital Library
Alberto Castro, Heiko Appel, Micael Oliveira, Carlo A. Rozzi, Florian Lorenzen, Xavier Andrade, Miguel A. L. Marques, E. K. U. Gross, and Angel Rubio. 2006. Octopus: A tool for the application of time-dependent density functional theory. Physica Status Solidi: Towards Atomistic Mater. Des. 243, 11, 2465--2488.Google ScholarCross Ref
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). IEEE Press, 1--12. DOI:http://dx.doi.org/10.1145/1413370.1413375. Google ScholarDigital Library
Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev. 51, 1, 129--159. DOI:http://dx.doi.org/10.1137/070693199. Google ScholarDigital Library
Kaushik Datta, Samuel Williams, Vasily Volkov, Jonathan Carter, Leonid Oliker, John Shalf, and Katherine Yelick. 2010. Auto-tuning stencil computations on multicore and accelerators. In Scientific Computing on Multicore and Accelerators, CRC Press, Boca Raton, FL, 219--253. http://dl.acm.org/citation.cfm?id=1413370.1413375Google Scholar
Raúl de la Cruz and Mauricio Araya-Polo. 2011. Towards a multi-level cache performance model for 3d stencil computation. In Proceedings of the International Conference on Computational Science (ICCS’11). Vol. 4, Elsevier, 2146--2155. http://dblp.unitrier.de/db/journals/procedia/procedia4.html&num;CruzA11.Google ScholarCross Ref
Raúl de la Cruz, Mauricio Araya-Polo, and Jose Mara Cela. 2009. Introducing the semi-stencil algorithm. In Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics (PPAM’09). Vol. 6067, Springer, 496--506. http://dl.acm.org/citation.cfm?id=1882792.1882852. Google ScholarDigital Library
Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In Proceedings of the 19th ACM International Conference on Supercomputing. ACM Press, New York, 361--366. DOI:http://dx.doi.org/10.1145/1088149.1088197. Google ScholarDigital Library
Matteo Frigo and Volker Strumpen. 2006. The cache complexity of multithreaded cache oblivious algorithms. In Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’06). ACM Press, New York, 271--280. DOI:http://dx.doi.org/10.1145/1148109.1148157. Google ScholarDigital Library
C. De Groot-Hedlin. 2008. A finite difference solution to the helmholtz equation in a radially symmetric waveguide: Application to near-source scattering in ocean acoustics. J. Comput. Acoust. 16, 3, 447--464. DOI:http://dx.doi.org/10.1142/s0218396x08003683.Google ScholarCross Ref
Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine Yelick. 2005. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the Workshop on Memory System Performance (MSP’05). ACM Press, New York, 36--43. DOI:http://dx.doi.org/10.1145/1111583.1111589. Google ScholarDigital Library
Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the Workshop on Memory System Performance and Correctness (MSPC’06). ACM Press, New York, 51--60. DOI:http://dx.doi.org/10.1145/1178597.1178605. Google ScholarDigital Library
Jean Kormann, Pedro Cobo, and Andres Prieto. 2008. Perfectly matched layers for modelling seismic oceanography experiments. J. Sound Vibrat. 317, 1--2, 354--365. DOI:http://dx.doi.org/10.1016/j.jsv.2008.03.024.Google ScholarCross Ref
Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The cache performance and optimizations of blocked algorithms. SIGOPS Oper. Syst. Rev. 25, 63--74. DOI:http://dx.doi.org/10.1145/106974.106981. Google ScholarDigital Library
Naraig Manjikian and Tarek S. Abdelrahman. 1997. Fusion of loops for parallelism and locality. IEEE Trans. Parallel Distrib. Syst. 8, 2, 193--209. DOI:http://dx.doi.org/10.1109/71.577265. Google ScholarDigital Library
John McCalpin and David Wonnacott. 1999. Time skewing: A value-based approach to optimizing for memory locality. Tech. rep. DCS-TR-379, Department of Computer Science, Rutgers University. http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html.Google Scholar
Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4, 424--453. DOI:http://dx.doi.org/10.1145/233561.233564. Google ScholarDigital Library
George A. McMechan. 1989. A review of seismic acoustic imaging by reverse-time migration. Int. J. Imaging Syst. Technol. 1, 1, 0899--9457. DOI:http://dx.doi.org/10.1002/ima.1850010104.Google ScholarCross Ref
Todd Mowry and Anoop Gupta. 1991. Tolerating latency through software-controlled data prefetching. J. Parallel Distrib. Comput. 12, 87--106. DOI:http://dx.doi.org/10.1016/0743-7315(91)90014-Z. Google ScholarDigital Library
Philip J. Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A portable interface to hardware performance counters. In Proceedings of the Department of Defense HPCMP Users Group Conference. 7--10.Google Scholar
Stephane Operto, Jean Virieux, Patrick Amestoy, Luc Giraud, and Jean-Yves L’Excellent. 2006. 3D frequency-domain finite-difference modeling of acoustic wave propagation using a massively parallel direct solver: A feasibility study. SEG Tech. Program Expanded Abstracts 72, 5, 2265--2269. DOI:http://dx.doi.org/10.1190/1.2369987.Google Scholar
Gabriel Rivera and Chau Wen Tseng. 2000. Tiling optimizations for 3d scientific computations. In Proceedings of the ACM/IEEE Supercomputing Conference (SC’00). IEEE Computer Society. http://dl.acm.org/citation.cfm?id=370049.370403. Google ScholarDigital Library
Anne Rogers and Kai Li. 1992. Software support for speculative loads. SIGPLAN Not. 27, 9, 38--50. DOI:http://dx.doi.org/10.1145/143371.143484. Google ScholarDigital Library
Olivier Temam, Christine Fricker, and William Jalby. 1994. Cache interference phenomena. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’94). ACM Press, New York, 261--271. DOI:http://dx.doi.org/10.1145/183018.183047. Google ScholarDigital Library
Jan Treibig and Georg Hager. 2009. Introducing a performance model for bandwidth-limited loop kernels. In Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics (PPAM’09). Vol. 6067, Springer, 615--624. http://dl.acm.org/citation.cfm?id=1882792.1882865. Google ScholarDigital Library
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010a. Complexities of performance prediction for bandwidth-limited loop kernels on multi-core architectures. In High Performance Computing in Science and Engineering, S. Wagner, M. Steinmetz, A. Bode, and M. M. Muller Eds., Springer, 3--12.Google Scholar
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010B. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 39th International Conference on Parallel Processing Workshops (ICPPW’10). IEEE Computer Society, 207--216. DOI:http://dx.doi.org/10.1109/ICPPW.2010.38. Google ScholarDigital Library
Samuel Webb Williams, Andrew Waterman, and David A. Patterson. 2008. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Tech. rep. UCB/EECS-2008-134, EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-134.html.Google Scholar
David Wonnacott. 2000. Time skewing for parallel computers. In Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing (LCPC’99). Springer, 477--480. http://portal.acm.org/citation.cfm?id=645677.663799. Google ScholarDigital Library

Index Terms

Algorithm 942: Semi-Stencil
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations in finite fields

Recommendations

Panda: A Compiler Framework for Concurrent CPU$$+$$+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of ...
Read More
Modeling Stencil Computations on Modern HPC Architectures
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation
Abstract
Stencil computations are widely used for solving Partial Differential Equations (PDEs) explicitly by Finite Difference schemes. The stencil solver alone -depending on the governing equation- can represent up to 90 % of the overall elapsed time, of ...
Read More
Automatic Vectorization of Stencil Codes with the GGDML Language Extensions
WPMVP'19: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing

Partial differential equation (PDE) solvers are important for many applications. PDE solvers execute kernels which apply stencil operations over 2D and 3D grids. As PDE solvers and stencil codes are widely used in performance critical applications, they ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Mathematical Software Volume 40, Issue 3
April 2014
152 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/2610268
Editor:
Michael A. Heroux
Sandia National Laboratories USA
Issue’s Table of Contents
Copyright © 2014 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2014
- Accepted: 1 October 2013
- Revised: 1 September 2013
- Received: 1 January 2012
Published in toms Volume 40, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available
- Artifacts Evaluated & Reusable
Author Tags
HPC
Semi-stencil
Stencil computation
blocking
cache oblivious
code optimization
numerical algorithms
performance model
time skewing
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 382
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Algorithm 942: Semi-Stencil

ACM Transactions on Mathematical Software

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Panda: A Compiler Framework for Concurrent CPU$$+$$+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Modeling Stencil Computations on Modern HPC Architectures

Automatic Vectorization of Stencil Codes with the GGDML Language Extensions