skip to main content
10.1145/3243176.3243192acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Stencil codes on a vector length agnostic architecture

Published:01 November 2018Publication History

ABSTRACT

Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some these issues, Arm recently released a new vector ISA, the Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length.

In this paper we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2,048 bits show that these optimizations can lead to performance improvements over straight-forward vectorized code of up to 56.6% for 2,048 bit vectors. In addition, we show that certain optimizations can hurt performance due to a reduction in arithmetic intensity, and provide insight useful for compiler optimizers.

References

  1. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011 - Conference Proceedings. 676--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Intel Corporation. 2016. Intel Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdfGoogle ScholarGoogle Scholar
  4. Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David A. Patterson, John Shalf, and Katherine A. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15--21, 2008, Austin, Texas, USA. 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hikmet Dursun, Ken-ichi Nomura, Liu Peng, Richard Seymour, Weiqiang Wang, Rajiv K. Kalia, Aiichiro Nakano, and Priya Vashishta. 2009. A Multilevel Parallelization Framework for High-Order Stencil Computations. In Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25--28, 2009. Proceedings. 642--653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Roger Espasa, Mateo Valero, and James E. Smith. 1998. Vector Architectures: Past, Present and Future. In Proceedings of the 12th international conference on Supercomputing, ICS 1998, Melbourne, Australia, July 13--17, 1998. 425--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS 2005, Cambridge, Massachusetts, USA, June 20--22, 2005. 361--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Fuller. 1998. Motorola's AltiVec™ Technology. Technical Report. Motorola Inc,. http://www.nxp.com/assets/documents/data/en/fact-sheets/ALTIVECWP.pdfGoogle ScholarGoogle Scholar
  9. A. Heimlich, ACA Mol, and CMNA Pereira. 2011. GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation. Progress in Nuclear Energy 53, 2 (2011), 229--239.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. L. Hennessy and D. A. Patterson. 2006. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Shoaib Kamil, Cy P. Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19--23 April 2010 - Conference Proceedings. 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  12. Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine A. Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory System Performance and Correctness, San Jose, California, USA, October 11, 2006. 51--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine A. Yelick. 2005. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the 2005 workshop on Memory System Performance, Chicago, Illinois, USA, June 12, 2005. 36--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dimitri Komatitsch, Gordon Erlebacher, Dominik Göddeke, and David Michéa. 2010. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput. Physics 229, 20 (2010), 7692--7714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Kronawitter and C. Lengauer. 2014. Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation. HiStencils 2014, 75--80.Google ScholarGoogle Scholar
  16. Y. Lee, C. Schmidt, A. Ou, A. Waterman, and K. Asanovic. 2015. The Hwacha Vector-Fetch Architecture Manual. Technical Report. Electrical Engineering and Computer Sciences, University of California at Berkeley. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.pdfGoogle ScholarGoogle Scholar
  17. Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12--18, 2011. 11:1--11:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Molnár, F. Izsák, R. Mészáros, and I. Lagzi. 2011. Simulation of reaction-diffusion processes in three dimensions using CUDA. Chemometrics and Intelligent Laboratory Systems 108, 1 (2011), 76--85.Google ScholarGoogle ScholarCross RefCross Ref
  19. Liu Peng, Richard Seymour, Ken-ichi Nomura, Rajiv K. Kalia, Aiichiro Nakano, Priya Vashishta, Alexander Loddoch, Michael Netzband, William R. Volz, and Chap C. Wong. 2009. High-order stencil computations on multicore clusters. In 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23--29, 2009. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Reinders and J. Jeffers. 2014. High Performance Parallelism Pearls, Multicore and Many-core Programming Approaches. Morgan Kaufmann, Chapter Characterization and Auto-tuning of 3DFD, 377--396.Google ScholarGoogle Scholar
  21. R. M. Russell. 1978. The CRAY-1 Computer System. Commun. ACM (1978), 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanaël Prémillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM Scalable Vector Extension. IEEE Micro 37, 2 (2017), 26--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Szustak, K. Rojek, R. Wyrzykowski, and P. Gepner. 2014. Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture. Proce. HiStencils 14 (2014), 51--56.Google ScholarGoogle Scholar
  24. Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4--6, 2011 (Co-located with FCRC 2011). 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tommaso Toffoli and Norman Margolus. 1987. Cellular automata machines - a new environment for modeling. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. U. Trottenberg, C. W Oosterlee, and A. Schuller. 2000. Multigrid. Academic press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Waterman, Y. Lee, D. Patterson, and K. Asanovic. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report. Electrical Engineering and Computer Sciences, University of California at Berkeley. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.pdfGoogle ScholarGoogle Scholar
  28. Toshio Yoshida. 2016. Introduction of Fujitsu's HPC Processor for the Post-K Computer. In Hot Chips 28 Symposium (HCS) (Hot Chips '16). IEEE.Google ScholarGoogle Scholar
  29. Charles Yount. 2015. Vector Folding: Improving Stencil Performance via Multidimensional SIMD-vector Representation. In 17th IEEE International Conference on High Performance Computing and Communications, HPCC 2015, 7th IEEE International Symposium on Cyberspace Safety and Security, CSS 2015, and 12th IEEE International Conference on Embedded Software and Systems, ICESS 2015, New York, NY, USA, August 24--26, 2015. 865--870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK -Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning. In Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC@SC 2016, Salt Lake, UT, USA, November 14, 2016. 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. V. T. Zhukov, Mikhail M. Krasnov, N. D. Novikova, and O. B. Feodoritova. 2015. Multigrid effectiveness on modern computing architectures. Programming and Computer Software 41, 1 (2015), 14--22. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Stencil codes on a vector length agnostic architecture

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
        November 2018
        494 pages
        ISBN:9781450359863
        DOI:10.1145/3243176

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 November 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate121of471submissions,26%

        Upcoming Conference

        PACT '24
        International Conference on Parallel Architectures and Compilation Techniques
        October 14 - 16, 2024
        Southern California , CA , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader