Abstract
This article presents matrix-free finite-element techniques for efficiently solving partial differential equations on modern many-core processors, such as graphics cards. We develop a GPU parallelization of a matrix-free geometric multigrid iterative solver targeting moderate and high polynomial degrees, with support for general curved and adaptively refined hexahedral meshes with hanging nodes. The central algorithmic component is the matrix-free operator evaluation with sum factorization. We compare the node-level performance of our implementation running on an Nvidia Pascal P100 GPU to a highly optimized multicore implementation running on comparable Intel Broadwell CPUs and an Intel Xeon Phi. Our experiments show that the GPU implementation is approximately 1.5 to 2 times faster across four different scenarios of the Poisson equation and a variety of element degrees in 2D and 3D. The lowest time to solution per degree of freedom is recorded for moderate polynomial degrees between 3 and 5. A detailed performance analysis highlights the capabilities of the GPU architecture and the chosen execution model with threading within the element, particularly with respect to the evaluation of the matrix-vector product. Atomic intrinsics are shown to provide a fast way for avoiding the possible race conditions in summing the elemental residuals into the global vector associated to shared vertices, edges, and surfaces. In addition, the solver infrastructure allows for using mixed-precision arithmetic that performs the multigrid V-cycle in single precision with an outer correction in double precision, increasing throughput by up to 83%.
- Mark Adams, Marian Brezina, Jonathan Hu, and Ray Tuminaro. 2003. Parallel multigrid smoothing: Polynomial versus Gauss--Seidel. Journal of Computational Physics 188 (2003), 593--610. Google ScholarDigital Library
- Mark Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, and Samuel Williams. 2019. High-performance geometric multigrid. Retrieved January 18, 2019 from https://hpgmg.org/.Google Scholar
- Mark Adams, Phillip Colella, Daniel T. Graves, Jeff N. Johnson, Hans S. Johansen, Noel D. Keen, Terry J. Ligocki, et al. 2015. Chombo Software Package for AMR Applications Design Document. Technical Report. Lawrence Berkeley National Laboratory. https://crd.lbl.gov/assets/pubs_presos/chomboDesign.pdf.Google Scholar
- Mark F. Adams, Ravi Samtaney, and Achi Brandt. 2010. Toward textbook multigrid efficiency for fully implicit resistive magnetohydrodynamics. Journal of Computational Physics 229, 18 (2010), 6208--6219. Google ScholarDigital Library
- Giovanni Alzetta, Daniel Arndt, Wolfgang Bangerth, Vishal Boddu, Benjamin Brands, Denis Davydov, Rene Gassmöller, et al. 2018. The deal.II library, version 9.0. Journal of Numerical Mathematics 26, 4 (2018), 173--184.Google ScholarCross Ref
- Wolfgang Bangerth, Carsten Burstedde, Timo Heister, and Martin Kronbichler. 2011. Algorithms and data structures for massively parallel generic finite element codes. ACM Transactions on Mathematical Software 38, 2 (2011), Article 14. Google ScholarDigital Library
- Marsha J. Berger, Michael J. Aftosmis, and Gedas Adomavicius. 2001. Parallel multigrid on Cartesian meshes with complex geometry. In Parallel Computational Fluid Dynamics. North Holland, Amsterdam, 283--290.Google Scholar
- Achi Brandt. 1977. Multi-level adaptive solutions to boundary-value problems. Mathematics of Computation 31 (1977), 333--390.Google ScholarCross Ref
- Jed Brown. 2010. Efficient nonlinear solvers for nodal high-order finite elements in 3D. Journal of Scientific Computing 45, 1--3 (2010), 48--63. Google ScholarDigital Library
- Jed Brown, Veselin Dobrev, Som Dutta, Paul Fischer, Kazem Kamran, Tzanio Kolev, David Medina, et al. 2018. ECP Milestone Report: Propose High-Order Mesh/Data Format, WBS 2.2.6.06, Milestone CEED-MS18. Technical Report. U.S. Department of Energy. https://ceed.exascaleproject.org/docs/ceed-ms18-report.pdf.Google Scholar
- Jed Brown, Barry Smith, and Aron Ahmadia. 2013. Achieving textbook multigrid efficiency for hydrostatic ice sheet flow. SIAM Journal on Scientific Computing 35, 2 (2013), B359--B375.Google ScholarCross Ref
- CEED. 2018. CEED Bake-Off Problems (Benchmarks). Retrieved April 8, 2019 from http://ceed.exascaleproject.org/bps/.Google Scholar
- Justin Chang, Maurice S. Fabien, Matthew G. Knepley, and Richard T. Mills. 2018. Comparative study of finite element methods using the time-accuracy-size (TAS) spectrum analysis. SIAM Journal on Scientific Computing 40, 6 (2018), C779--C802.Google ScholarCross Ref
- Thomas C. Clevenger, Timo Heister, Guido Kanschat, and Martin Kronbichler. 2019. A flexible, parallel, adaptive geometric multigrid method for FEM. arXiv:1904.03317 preprint.Google Scholar
- Michel O. Deville, Paul F. Fischer, and Ernest H. Mund. 2002. High-Order Methods for Incompressible Fluid Flow. Vol. 9. Cambridge University Press.Google Scholar
- Maurice S. Fabien, Matthew G. Knepley, Richard T. Mills, and Beatrice M. Rivière. 2019. Manycore parallel computing for a hybridizable discontinuous Galerkin nested multigrid method. SIAM Journal on Scientific Computing 41, 2 (2019), C73-C96.Google ScholarCross Ref
- Niklas Fehn, Wolfgang A. Wall, and Martin Kronbichler. 2018. Efficiency of high-performance discontinuous Galerkin spectral element methods for under-resolved turbulent incompressible flows. International Journal for Numerical Methods in Fluids 88, 1 (2018), 32--54.Google ScholarCross Ref
- Niklas Fehn, Wolfgang A. Wall, and Martin Kronbichler. 2019. A matrix-free high-order discontinuous Galerkin compressible Navier--Stokes solver: A performance comparison of compressible and incompressible formulations for turbulent incompressible flows. International Journal for Numerical Methods in Fluids 89, 3 (2019), 71--102.Google ScholarCross Ref
- Chunsheng Feng, Shi Shu, Jinchao Xu, and Chen-Song Zhang. 2014. Numerical study of geometric multigrid methods on CPU-GPU heterogeneous computers. Advances in Applied Mathematics and Mechanics 6, 1 (2014), 1--23.Google ScholarCross Ref
- Krzysztof J. Fidkowski, Todd A. Oliver, James Lu, and David L. Darmofal. 2005. p-multigrid solver of high-order discontinuous Galerkin discretizations of the compressible Navier--Stokes equations. Journal of Computational Physics 207, 1 (2005), 92--113. Google ScholarDigital Library
- Paul F. Fischer, Thilina Rathnayake, Som Dutta, Veselin Dobrev, Tzanio Kolev, Jean-Sylvain Camier, Martin Kronbichler, et al. 2019. Running faster in HPC applications. In Preparation.Google Scholar
- Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, and Stefan Turek. 2013. Towards a complete FEM-based simulation toolkit on GPUs: Unstructured grid finite element geometric multigrid solvers with strong smoothers based on sparse approximate inverses. Computers 8 Fluids 80 (2013), 327--332.Google Scholar
- Amir Gholami, Dhairya Malhotra, Hari Sundar, and George Biros. 2016. FFT, FMM, or multigrid? A comparative study of state-of-the-art Poisson solvers for uniform and nonuniform grids in the unit cube. SIAM Journal on Scientific Computing 38, 3 (2016), C280--C306.Google ScholarDigital Library
- Bjorn Gmeiner, Ulrich Rüde, Holger Stengel, Christian Waluga, and Barbara Wohlmuth. 2015. Performance and scalability of hierarchical hybrid multigrid solvers for stokes systems. SIAM Journal on Scientific Computing 37, 2 (2015), C143--C168.Google ScholarCross Ref
- Dominik Göddeke, Robert Strzodka, and Stefan Turek. 2007. Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations. International Journal of Parallel Emergent and Distributed Systems 22, 4 (2007), 221--256. Google ScholarDigital Library
- William D. Gropp, Dinesh K. Kaushik, David E. Keyes, and Barry F. Smith. 1999. Towards realistic performance bounds for implicit CFD codes. In Proceedings of Parallel CFD’99.Google Scholar
- William D. Gropp, Dinesh K. Kaushik, David E. Keyes, and Barry F. Smith. 2000. Performance modeling and tuning of an unstructured mesh CFD application. In SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing. 34. Google ScholarDigital Library
- Mark Harris. 2007. Optimizing CUDA. In SC’07: High Performance Computing With CUDA. Nvidia.Google Scholar
- Brian Helenbrook, Dimitri J. Mavriplis, and Harold L. Atkins. 2003. Analysis of p-multigrid for continuous and discontinuous finite element discretizations. In Proceedings of the 16th AIAA Computational Fluid Dynamics Conference.Google Scholar
- Michael A. Heroux, Roscoe A. Bartlett, Vicki E. Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, et al. 2005. An overview of the Trilinos project. ACM Transactions on Mathematical Software 31, 3 (2005), 397--423. Google ScholarDigital Library
- Bärbel Janssen and Guido Kanschat. 2011. Adaptive multilevel methods with local smoothing for H1-and Hcurl-conforming high order finite element methods. SIAM Journal on Scientific Computing 33, 4 (2011), 2095--2114. Google ScholarDigital Library
- James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming, Knights Landing Edition. Morgan Kaufmann, Cambridge, MA. Google ScholarDigital Library
- Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the Volta GPU Architecture Through Microbenchmarking: GTC 2018. Retrieved April 8, 2019 from http://on-demand.gputechconf.com/gtc/2018/presentation/s8122-dissecting-the-volta-gpu-architecture-through-microbenchmarking.pdf.Google Scholar
- Alison C. Jones and Peter K. Jimack. 2005. An adaptive multigrid tool for elliptic and parabolic systems. International Journal for Numerical Methods in Fluids 47 (2005), 1123--1128.Google ScholarCross Ref
- Guido Kanschat. 2004. Multilevel methods for discontinuous Galerkin FEM on locally refined meshes. Computers 8 Structures 82, 28 (2004), 2437--2445.Google Scholar
- George E. Karniadakis and Spencer J. Sherwin. 2013. Spectral/HP Element Methods for Computational Fluid Dynamics. Oxford University Press.Google Scholar
- Donald W. Kelly, Jorge P. De Sousa Rodrigues Gago, Olgierd C. Zienkiewicz, and Ivo Babuška. 1983. A posteriori error analysis and adaptive processes in the finite element method: Part I—Error analysis. International Journal for Numerical Methods in Engineering 19 (1983), 1593--1619.Google ScholarCross Ref
- Andreas Klöckner. 2014. Loo.py: Transformation-based code generation for GPUs and CPUs. In Proceedings of ARRAY’14: ACM SIGPLAN Workshop on Libraries, Languages, and Compilers for Array Programming. Google ScholarDigital Library
- Andreas Klöckner, Tim Warburton, Jeffrey Bridge, and Jan S. Hesthaven. 2009. Nodal discontinuous Galerkin methods on graphics processors. Journal of Computational Physics 228, 21 (2009), 7863--7882. Google ScholarDigital Library
- Matthew G. Knepley and Andy R. Terrel. 2013. Finite element integration on GPUs. ACM Transactions on Mathematical Software 39, 2 (Feb. 2013), Article 10, 13 pages. Google ScholarDigital Library
- Dimitri Komatitsch, Gordon Erlebacher, Dominik Göddeke, and David Michéa. 2010. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. Journal of Computational Physics 229, 20 (2010), 7692--7714. Google ScholarDigital Library
- David A. Kopriva. 2009. Implementing Spectral Methods for Partial Differential Equations: Algorithms for Scientists and Engineers. Springer. Google Scholar
- Katharina Kormann and Martin Kronbichler. 2011. Parallel finite element operator application: Graph partitioning and coloring. In Proceedings of the 7th IEEE International Conference on eScience. 332--339. Google ScholarDigital Library
- Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.Google ScholarDigital Library
- Martin Kronbichler and Katharina Kormann. 2012. A generic interface for parallel cell-based finite element operator application. Computers 8 Fluids 63 (2012), 135--147.Google Scholar
- Martin Kronbichler and Katharina Kormann. 2019. Fast matrix-free evaluation of discontinuous Galerkin finite element operators. ACM Transactions on Mathematical Software, in press (2019).Google Scholar
- Martin Kronbichler, Katharina Kormann, Igor Pasichnyk, and Momme Allalen. 2017. Fast matrix-free discontinuous Galerkin kernels on modern computer architectures. In ISC High Performance. Lecture Notes in Computer Science, Vol. 10266. Springer, 237--255.Google Scholar
- Martin Kronbichler and Wolfgang A. Wall. 2018. A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers. SIAM Journal on Scientific Computing 40, 5 (2018), A3423--A3448.Google ScholarDigital Library
- Karl Ljungkvist. 2014. Matrix-free finite-element operator application on graphics processing units. In Euro-Par 2014: Parallel Processing Workshops. Lecture Notes in Computer Science, Vol. 8806. Springer, 450--461.Google Scholar
- Karl Ljungkvist. 2017. Matrix-free finite-element computations on graphics processors with adaptively refined unstructured meshes. In HPC’17: Proceedings of the 25th High Performance Computing Symposium. Google ScholarDigital Library
- James W. Lottes and Paul F. Fischer. 2005. Hybrid multigrid-Schwarz algorithms for the spectral element method. Journal of Scientific Computing 24 (2005), 613--646.Google ScholarCross Ref
- Hong Luo, Joseph D. Baum, and Rainald Löhner. 2006. A p-multigrid discontinuous Galerkin method for the Euler equations on unstructured grids. Journal of Computational Physics 211 (2006), 767--783. Google ScholarDigital Library
- Robert E. Lynch, John R. Rice, and Donald H. Thomas. 1964. Direct solution of partial difference equations by tensor product methods. Numerische Mathematik 6 (1964), 185--199. Google ScholarDigital Library
- Dave A. May, Jed Brown, and Laetitia Le Pourhiet. 2014. pTatin3D: High-performance methods for long-term lithospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’14). 274--284. Google ScholarDigital Library
- Stephen McCormick and James Thomas. 1986. The fast adaptive composite grid (FAC) method for elliptic equations. Mathematics of Computation 46, 174 (1986), 439--456. Google ScholarDigital Library
- Eike Müller, Xu Guo, Robert Scheichl, and Sinan Shi. 2013. Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs. Computing and Visualization in Science 16, 2 (2013), 41--58. Google ScholarDigital Library
- Nvidia Corporation. 2013. CUDA cuSPARSE Library. Version 8.0. Nvidia.Google Scholar
- Nvidia Corporation. 2016. GP100 Pascal Whitepaper. Technical Report. Version 1.1. Nvidia.Google Scholar
- Nvidia Corporation. 2016. Kepler Tuning Guide. Nvidia.Google Scholar
- Steven A. Orszag. 1980. Spectral methods for problems in complex geometries. Journal of Computational Physics 37, 1 (1980), 70--92.Google ScholarCross Ref
- Anthony T. Patera. 1984. A spectral element method for fluid dynamics: Laminar flow in a channel expansion. Journal of Computational Physics 54, 3 (1984), 468--488.Google ScholarCross Ref
- Yehuda Pinchover and Jacob Rubinstein. 2005. An Introduction to Partial Differential Equations. Cambridge University Press. Google ScholarDigital Library
- Jean-Francois Remacle, Rajesh Gandham, and Timothy Warburton. 2016. GPU accelerated spectral finite elements on all-hex meshes. Journal of Computational Physics 324 (2016), 246--257. Google ScholarDigital Library
- Johann Rudi, A. Cristiano I. Malossi, Tobin Isaac, Georg Stadler, Michael Gurnis, Peter W. J. Staar, Yves Ineichen, et al. 2015. An extreme-scale implicit solver for complex PDEs: Highly heterogeneous flow in Earth’s mantle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’15). 5:1--5:11. Google ScholarDigital Library
- Jörg Stiller. 2017. Nonuniformly weighted Schwarz smoothers for spectral element multigrid. Journal of Scientific Computing 72 (2017), 81--96. Google ScholarDigital Library
- Hari Sundar, George Biros, Carsten Burstedde, Johann Rudi, Omar Ghattas, and Georg Stadler. 2012. Parallel geometric-algebraic multigrid on unstructured forests of octrees. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’12). Google ScholarDigital Library
- Hari Sundar, Georg Stadler, and George Biros. 2015. Comparison of multigrid algorithms for high-order continuous finite element discretizations. Numerical Linear Algebra With Applications 22 (2015), 664--680.Google ScholarCross Ref
- Kasia Świrydowicz, Noel Chalmers, Ali Karakus, and Timothy Warburton. 2019. Acceleration of tensor-product operations for high-order finite element methods. International Journal for High Performance Computing Applications. In Press.Google Scholar
- James L. Thomas, Boris Diskin, and Achi Brandt. 2003. Textbook multigrid efficiency for fluid simulations. Annual Reviews of Fluid Mechanics 35 (2003), 317--340.Google ScholarCross Ref
- Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of PSTI’10: The 1st International Workshop on Parallel Software Tools and Tool Infrastructures. https://github.com/RRZE-HPC/likwid. Google ScholarDigital Library
- Ulrich Trottenberg, Cornelius Oosterlee, and Anton Schüller. 2001. Multigrid. Elsevier Academic Press, London, UK. Google ScholarDigital Library
- Henry M. Tufo and Paul F. Fischer. 1999. Terascale spectral element algorithms and implementations. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing. ACM, New York, NY. Google ScholarDigital Library
- Bruno Turcksin, Martin Kronbichler, and Wolfgang Bangerth. 2016. WorkStream—A design pattern for multicore-enabled finite element computations. ACM Transactions on Mathematical Software 43, 1 (2016), Article 2, 29 pages. Google ScholarDigital Library
- Vasily Volkov. 2010. Better performance at lower occupancy. In Proceedings of the GPU Technology Conference (GTC’10), Vol. 10. 16.Google Scholar
- Nicholas Wilt. 2013. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education.Google Scholar
Index Terms
- Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors
Recommendations
Optimization of geometric multigrid for emerging multi- and manycore processors
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisMultigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
GPU-based matrix-free finite element solver exploiting symmetry of elemental matrices
AbstractMatrix-free solvers for finite element method (FEM) avoid assembly of elemental matrices and replace sparse matrix-vector multiplication required in iterative solution method by an element level dense matrix-vector product. In this paper, a novel ...
Comments