ABSTRACT
Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some these issues, Arm recently released a new vector ISA, the Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length.
In this paper we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2,048 bits show that these optimizations can lead to performance improvements over straight-forward vectorized code of up to 56.6% for 2,048 bit vectors. In addition, we show that certain optimizations can hurt performance due to a reduction in arithmetic intensity, and provide insight useful for compiler optimizers.
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7. Google ScholarDigital Library
- Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011 - Conference Proceedings. 676--687. Google ScholarDigital Library
- Intel Corporation. 2016. Intel Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdfGoogle Scholar
- Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David A. Patterson, John Shalf, and Katherine A. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15--21, 2008, Austin, Texas, USA. 4. Google ScholarDigital Library
- Hikmet Dursun, Ken-ichi Nomura, Liu Peng, Richard Seymour, Weiqiang Wang, Rajiv K. Kalia, Aiichiro Nakano, and Priya Vashishta. 2009. A Multilevel Parallelization Framework for High-Order Stencil Computations. In Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25--28, 2009. Proceedings. 642--653. Google ScholarDigital Library
- Roger Espasa, Mateo Valero, and James E. Smith. 1998. Vector Architectures: Past, Present and Future. In Proceedings of the 12th international conference on Supercomputing, ICS 1998, Melbourne, Australia, July 13--17, 1998. 425--432. Google ScholarDigital Library
- Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS 2005, Cambridge, Massachusetts, USA, June 20--22, 2005. 361--366. Google ScholarDigital Library
- S. Fuller. 1998. Motorola's AltiVec™ Technology. Technical Report. Motorola Inc,. http://www.nxp.com/assets/documents/data/en/fact-sheets/ALTIVECWP.pdfGoogle Scholar
- A. Heimlich, ACA Mol, and CMNA Pereira. 2011. GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation. Progress in Nuclear Energy 53, 2 (2011), 229--239.Google ScholarCross Ref
- J. L. Hennessy and D. A. Patterson. 2006. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarDigital Library
- Shoaib Kamil, Cy P. Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19--23 April 2010 - Conference Proceedings. 1--12.Google ScholarCross Ref
- Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine A. Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory System Performance and Correctness, San Jose, California, USA, October 11, 2006. 51--60. Google ScholarDigital Library
- Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine A. Yelick. 2005. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the 2005 workshop on Memory System Performance, Chicago, Illinois, USA, June 12, 2005. 36--43. Google ScholarDigital Library
- Dimitri Komatitsch, Gordon Erlebacher, Dominik Göddeke, and David Michéa. 2010. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput. Physics 229, 20 (2010), 7692--7714. Google ScholarDigital Library
- S. Kronawitter and C. Lengauer. 2014. Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation. HiStencils 2014, 75--80.Google Scholar
- Y. Lee, C. Schmidt, A. Ou, A. Waterman, and K. Asanovic. 2015. The Hwacha Vector-Fetch Architecture Manual. Technical Report. Electrical Engineering and Computer Sciences, University of California at Berkeley. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.pdfGoogle Scholar
- Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12--18, 2011. 11:1--11:12. Google ScholarDigital Library
- F. Molnár, F. Izsák, R. Mészáros, and I. Lagzi. 2011. Simulation of reaction-diffusion processes in three dimensions using CUDA. Chemometrics and Intelligent Laboratory Systems 108, 1 (2011), 76--85.Google ScholarCross Ref
- Liu Peng, Richard Seymour, Ken-ichi Nomura, Rajiv K. Kalia, Aiichiro Nakano, Priya Vashishta, Alexander Loddoch, Michael Netzband, William R. Volz, and Chap C. Wong. 2009. High-order stencil computations on multicore clusters. In 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23--29, 2009. 1--11. Google ScholarDigital Library
- J. Reinders and J. Jeffers. 2014. High Performance Parallelism Pearls, Multicore and Many-core Programming Approaches. Morgan Kaufmann, Chapter Characterization and Auto-tuning of 3DFD, 377--396.Google Scholar
- R. M. Russell. 1978. The CRAY-1 Computer System. Commun. ACM (1978), 63--72. Google ScholarDigital Library
- Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanaël Prémillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM Scalable Vector Extension. IEEE Micro 37, 2 (2017), 26--39. Google ScholarDigital Library
- L. Szustak, K. Rojek, R. Wyrzykowski, and P. Gepner. 2014. Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture. Proce. HiStencils 14 (2014), 51--56.Google Scholar
- Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4--6, 2011 (Co-located with FCRC 2011). 117--128. Google ScholarDigital Library
- Tommaso Toffoli and Norman Margolus. 1987. Cellular automata machines - a new environment for modeling. MIT Press. Google ScholarDigital Library
- U. Trottenberg, C. W Oosterlee, and A. Schuller. 2000. Multigrid. Academic press. Google ScholarDigital Library
- A. Waterman, Y. Lee, D. Patterson, and K. Asanovic. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report. Electrical Engineering and Computer Sciences, University of California at Berkeley. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.pdfGoogle Scholar
- Toshio Yoshida. 2016. Introduction of Fujitsu's HPC Processor for the Post-K Computer. In Hot Chips 28 Symposium (HCS) (Hot Chips '16). IEEE.Google Scholar
- Charles Yount. 2015. Vector Folding: Improving Stencil Performance via Multidimensional SIMD-vector Representation. In 17th IEEE International Conference on High Performance Computing and Communications, HPCC 2015, 7th IEEE International Symposium on Cyberspace Safety and Security, CSS 2015, and 12th IEEE International Conference on Embedded Software and Systems, ICESS 2015, New York, NY, USA, August 24--26, 2015. 865--870. Google ScholarDigital Library
- Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK -Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning. In Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC@SC 2016, Salt Lake, UT, USA, November 14, 2016. 30--39. Google ScholarDigital Library
- V. T. Zhukov, Mikhail M. Krasnov, N. D. Novikova, and O. B. Feodoritova. 2015. Multigrid effectiveness on modern computing architectures. Programming and Computer Software 41, 1 (2015), 14--22. Google ScholarDigital Library
Index Terms
- Stencil codes on a vector length agnostic architecture
Recommendations
Enabling SIMT Execution Model on Homogeneous Multi-Core System
Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential ...
Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions
This paper proposes a low-complexity vector-core called LcVc for executing both scalar and vector instructions on the same execution datapath. A unified register file in the decode stage is used for storing both scalar operands and vector elements. The ...
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE
We are witnessing the consolidation of the heterogeneous computing in parallel computing with architectures such as Cell Broadband Engine (Cell BE) or Graphics Processing Units (GPUs) which are present in a myriad of developments for high performance ...
Comments