skip to main content
10.1145/2016604.2016641acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Understanding stencil code performance on multicore architectures

Published: 03 May 2011 Publication History

Abstract

Stencil computations are the foundation of many large applications in scientific computing. Previous research has shown that several optimization mechanisms, including rectangular blocking and time skewing combined with wavefront- and pipeline-based parallelization, can be used to significantly improve the performance of stencil kernels on multi-core architectures. However, the overall performance impact of these optimizations are difficult to predict due to the interplay of load imbalance, synchronization overhead, and cache locality. This paper presents a detailed performance study of these optimizations by applying them with a wide variety of different configurations, using hardware counters to monitor the efficiency of architectural components, and then developing a set of formulas via regression analysis to model their overall performance impact in terms of the affected hardware counter numbers. We have applied our methodology to three stencil computation kernels, a 7-point jacobi, a 27-point jacobi, and a 7-point Gauss-Seidel computation. Our experimental results show that a precise formula can be developed for each kernel to accurately model the overall performance impact of varying optimizations and thereby effectively guide the performance analysis and tuning of these kernels.

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCtoolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, To Appear, 2009.
[2]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.
[3]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI '08: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, pages 101--113, New York, NY, USA, 2008. ACM.
[4]
J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O'Boyle, and O. Temam. Rapidly selecting good compiler optimizations using performance counters. In GGO '07; Proceedings of the Paternational Symposium on Code Generation and Optimization, pages 185--197, Washington, DC, USA, 2007. IEEE Computer Society.
[5]
C. Chen, J. Chame, and M. Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In Inaternational Symposium on Code Generation and Optimization, March 2005.
[6]
M. Christen, O. Schenk, E. Neufeld, P. Messmer, and H. Burkhart. Parallel data-locality aware stencil computations on modern micro-architectures. In IPDPS '09: Proceedings of the 2009 IEEE Paternational Symposium on Parallels Distributed Processing, pages 1--10, Washington, DC, USA, 2009. IEEE Computer Society.
[7]
K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 51(1):129--159, 2009.
[8]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SG08), 2008.
[9]
S. Eranian. What can performance counters do for memory subsystem analysis? In MSPG '08: Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness, pages 26--30, 2008.
[10]
B. Fraguela, Y. Voronenko, and M. Puschel. Automatic tuning of discrete fourier transforms driven by analytical modeling. In PACT'09: Parallel Architectures and Compilation Techniques, Raleigh, NC, Sept. 2009.
[11]
Intel Pentium 4 Processor Optimization Reference Manual. Intel Corporation, 2000.
[12]
S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing. IEEE Computer Society, 2010.
[13]
S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness, pages 51--60, New York, NY, USA, 2006. ACM.
[14]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. SIGPLAN Not., 42(6):235--244, 2007.
[15]
L. Liu and Z. Li. Improving parallelism and locality with asynchronous algorithms. In PPoPP '10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 213--222, New York, NY, USA, 2010. ACM.
[16]
G. Marin and J. Mellor-Crummey. Pinpointing and exploiting opportunities for enhancing data reuse. In In Proceedings of the 2008 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'08), 2008.
[17]
N. Peleg and B. Mendelson. Detecting change in program behavior for adaptive optimization. In 16th International Conference on Parallel Architecture and Compilation Techniques (PACT07), 2007.
[18]
S. F. Rahman, J. Guo, and Q. Yi. Automated empirical tuning of scientific codes for performance and power consumption. In HIPEAC':High-Performance and Embedded Architectures and Compilers (to appear), Heraklion, Greece, Jan 2011.
[19]
G. Rivera and C.-W. Tseng. Tiling optimizations for 3D scientific computations. In Supercomputing '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), page 32, Washington, DC, USA, 2000. IEEE Computer Society.
[20]
K. Singh, M. Bhadauria, and S. A. McKee. Real time power estimation and thread scheduling via performance counters. SIGARCH Comput. Archit. News, 37(2):46--55, 2009.
[21]
Song, Yonghong, and Z. Li. New tiling techniques to improve cache temporal locality. In PLDI '99: Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 215--228, New York, NY, USA, 1999. ACM.
[22]
F. Song, S. Moore, and J. Dongarra. Feedback-directed thread scheduling with memory considerations. In HPDC '07; Proceedings of the 16th international symposium on High performance distributed computing, 2007.
[23]
Y. Song, R. Xu, C. Wang, and Z. Li. Data locality enhancement by memory reduction. In Proceedings of the 15th ACM International Conference on Supercomputing, Sorrento, Italy, June 2001.
[24]
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In 13st International Conference on High- Performance Computer Architecture (HPCA-13), 2007.
[25]
N. R. Tallent and J. M. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (PPOPP09), 2009.
[26]
M. M. Tikir and J. K. Hollingsworth. Using hardware counters to automatically improve memory performance. In Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, SC, 2004.
[27]
J. Treibig, G. Wellein, and G. Hager. Efficient multicore-aware parallelization strategies for iterative stencil computations. Journal of Computational Science, In Press, 2011.
[28]
D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing (IPDPS00), page 171, Washington, DC, USA, 2000. IEEE Computer Society.
[29]
Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. POET: Parameterized optimizations for empirical tuning. In Workshop on Performance Optimization for High-Level Languages and Libraries, Mar 2007.

Cited By

View all
  • (2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-xOnline publication date: 15-May-2021
  • (2020)Efficient Acceleration of Stencil Applications through In-Memory ComputingMicromachines10.3390/mi1106062211:6(622)Online publication date: 26-Jun-2020
  • (2020)Predicting and Comparing the Performance of Array Management Libraries2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00097(906-915)Online publication date: May-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers
May 2011
268 pages
ISBN:9781450306980
DOI:10.1145/2016604
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

CF'11
Sponsor:
CF'11: Computing Frontiers Conference
May 3 - 5, 2011
Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-xOnline publication date: 15-May-2021
  • (2020)Efficient Acceleration of Stencil Applications through In-Memory ComputingMicromachines10.3390/mi1106062211:6(622)Online publication date: 26-Jun-2020
  • (2020)Predicting and Comparing the Performance of Array Management Libraries2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00097(906-915)Online publication date: May-2020
  • (2019)Reproducible stencil compiler benchmarks using prova! Future Generation Computer Systems10.1016/j.future.2018.05.02392:C(933-946)Online publication date: 1-Mar-2019
  • (2019)Modern Code Applied in Stencil in Edge Detection of an Image for Architecture Intel Xeon Phi KNLTechnologies and Innovation10.1007/978-3-030-34989-9_12(151-163)Online publication date: 20-Nov-2019
  • (2017)OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization MethodologyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.261498128:5(1390-1402)Online publication date: 1-May-2017
  • (2017)Last Level Collective Hardware Prefetching For Data-Parallel Applications2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00018(72-83)Online publication date: Dec-2017
  • (2017)Performance prediction of finite-difference solvers for different computer architecturesComputers & Geosciences10.1016/j.cageo.2017.04.014105(148-157)Online publication date: Aug-2017
  • (2017)PandaInternational Journal of Parallel Programming10.1007/s10766-016-0454-145:3(711-729)Online publication date: 1-Jun-2017
  • (2017)A quasi‐cache‐aware model for optimal domain partitioning in parallel geometric multigridConcurrency and Computation: Practice and Experience10.1002/cpe.432830:9Online publication date: 9-Oct-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media