research-article

Stencil codes on a vector length agnostic architecture

Authors:
Adrià Armejach

Universitat Politécnica de Catalunya

Universitat Politécnica de Catalunya
View Profile

,
Helena Caminal

Cornell University

Cornell University
View Profile

,
Juan M. Cebrian

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Rekai González-Alberquilla

Arm, Cambridge, UK

Arm, Cambridge, UK
View Profile

,
Chris Adeniyi-Jones

Arm, Cambridge, UK

Arm, Cambridge, UK
View Profile

,
Mateo Valero

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Marc Casas

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

,
Miquel Moretó

Barcelona Supercomputing Center

Barcelona Supercomputing Center
View Profile

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation TechniquesNovember 2018Article No.: 13Pages 1–12https://doi.org/10.1145/3243176.3243192

Published:01 November 2018Publication History

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Pages 1–12

ABSTRACT

Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some these issues, Arm recently released a new vector ISA, the Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length.

In this paper we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2,048 bits show that these optimizations can lead to performance improvements over straight-forward vectorized code of up to 56.6% for 2,048 bit vectors. In addition, we show that certain optimizations can hurt performance due to a reduction in arithmetic intensity, and provide insight useful for compiler optimizers.

References

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7. Google ScholarDigital Library
Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011 - Conference Proceedings. 676--687. Google ScholarDigital Library
Intel Corporation. 2016. Intel Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdfGoogle Scholar
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David A. Patterson, John Shalf, and Katherine A. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15--21, 2008, Austin, Texas, USA. 4. Google ScholarDigital Library
Hikmet Dursun, Ken-ichi Nomura, Liu Peng, Richard Seymour, Weiqiang Wang, Rajiv K. Kalia, Aiichiro Nakano, and Priya Vashishta. 2009. A Multilevel Parallelization Framework for High-Order Stencil Computations. In Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25--28, 2009. Proceedings. 642--653. Google ScholarDigital Library
Roger Espasa, Mateo Valero, and James E. Smith. 1998. Vector Architectures: Past, Present and Future. In Proceedings of the 12th international conference on Supercomputing, ICS 1998, Melbourne, Australia, July 13--17, 1998. 425--432. Google ScholarDigital Library
Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS 2005, Cambridge, Massachusetts, USA, June 20--22, 2005. 361--366. Google ScholarDigital Library
S. Fuller. 1998. Motorola's AltiVec™ Technology. Technical Report. Motorola Inc,. http://www.nxp.com/assets/documents/data/en/fact-sheets/ALTIVECWP.pdfGoogle Scholar
A. Heimlich, ACA Mol, and CMNA Pereira. 2011. GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation. Progress in Nuclear Energy 53, 2 (2011), 229--239.Google ScholarCross Ref
J. L. Hennessy and D. A. Patterson. 2006. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarDigital Library
Shoaib Kamil, Cy P. Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19--23 April 2010 - Conference Proceedings. 1--12.Google ScholarCross Ref
Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine A. Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory System Performance and Correctness, San Jose, California, USA, October 11, 2006. 51--60. Google ScholarDigital Library
Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine A. Yelick. 2005. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the 2005 workshop on Memory System Performance, Chicago, Illinois, USA, June 12, 2005. 36--43. Google ScholarDigital Library
Dimitri Komatitsch, Gordon Erlebacher, Dominik Göddeke, and David Michéa. 2010. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput. Physics 229, 20 (2010), 7692--7714. Google ScholarDigital Library
S. Kronawitter and C. Lengauer. 2014. Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation. HiStencils 2014, 75--80.Google Scholar
Y. Lee, C. Schmidt, A. Ou, A. Waterman, and K. Asanovic. 2015. The Hwacha Vector-Fetch Architecture Manual. Technical Report. Electrical Engineering and Computer Sciences, University of California at Berkeley. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.pdfGoogle Scholar
Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12--18, 2011. 11:1--11:12. Google ScholarDigital Library
F. Molnár, F. Izsák, R. Mészáros, and I. Lagzi. 2011. Simulation of reaction-diffusion processes in three dimensions using CUDA. Chemometrics and Intelligent Laboratory Systems 108, 1 (2011), 76--85.Google ScholarCross Ref
Liu Peng, Richard Seymour, Ken-ichi Nomura, Rajiv K. Kalia, Aiichiro Nakano, Priya Vashishta, Alexander Loddoch, Michael Netzband, William R. Volz, and Chap C. Wong. 2009. High-order stencil computations on multicore clusters. In 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23--29, 2009. 1--11. Google ScholarDigital Library
J. Reinders and J. Jeffers. 2014. High Performance Parallelism Pearls, Multicore and Many-core Programming Approaches. Morgan Kaufmann, Chapter Characterization and Auto-tuning of 3DFD, 377--396.Google Scholar
R. M. Russell. 1978. The CRAY-1 Computer System. Commun. ACM (1978), 63--72. Google ScholarDigital Library
Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanaël Prémillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM Scalable Vector Extension. IEEE Micro 37, 2 (2017), 26--39. Google ScholarDigital Library
L. Szustak, K. Rojek, R. Wyrzykowski, and P. Gepner. 2014. Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture. Proce. HiStencils 14 (2014), 51--56.Google Scholar
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4--6, 2011 (Co-located with FCRC 2011). 117--128. Google ScholarDigital Library
Tommaso Toffoli and Norman Margolus. 1987. Cellular automata machines - a new environment for modeling. MIT Press. Google ScholarDigital Library
U. Trottenberg, C. W Oosterlee, and A. Schuller. 2000. Multigrid. Academic press. Google ScholarDigital Library
A. Waterman, Y. Lee, D. Patterson, and K. Asanovic. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report. Electrical Engineering and Computer Sciences, University of California at Berkeley. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.pdfGoogle Scholar
Toshio Yoshida. 2016. Introduction of Fujitsu's HPC Processor for the Post-K Computer. In Hot Chips 28 Symposium (HCS) (Hot Chips '16). IEEE.Google Scholar
Charles Yount. 2015. Vector Folding: Improving Stencil Performance via Multidimensional SIMD-vector Representation. In 17th IEEE International Conference on High Performance Computing and Communications, HPCC 2015, 7th IEEE International Symposium on Cyberspace Safety and Security, CSS 2015, and 12th IEEE International Conference on Embedded Software and Systems, ICESS 2015, New York, NY, USA, August 24--26, 2015. 865--870. Google ScholarDigital Library
Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK -Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning. In Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC@SC 2016, Salt Lake, UT, USA, November 14, 2016. 30--39. Google ScholarDigital Library
V. T. Zhukov, Mikhail M. Krasnov, N. D. Novikova, and O. B. Feodoritova. 2015. Multigrid effectiveness on modern computing architectures. Programming and Computer Software 41, 1 (2015), 14--22. Google ScholarDigital Library

Index Terms

Stencil codes on a vector length agnostic architecture
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

Enabling SIMT Execution Model on Homogeneous Multi-Core System

Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential ...
Read More
Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions

This paper proposes a low-complexity vector-core called LcVc for executing both scalar and vector instructions on the same execution datapath. A unified register file in the decode stage is used for storing both scalar operands and vector elements. The ...
Read More
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

We are witnessing the consolidation of the heterogeneous computing in parallel computing with architectures such as Cell Broadband Engine (Cell BE) or Graphics Processing Units (GPUs) which are present in a myriad of developments for high performance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
November 2018
494 pages
ISBN:9781450359863
DOI:10.1145/3243176
General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data-level parallelism
scalable vector extension
stencil computations
vector length agnostic
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 257
  Total Downloads
- Downloads (Last 12 months)43
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Stencil codes on a vector length agnostic architecture

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enabling SIMT Execution Model on Homogeneous Multi-Core System

Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions

Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Stencil codes on a vector length agnostic architecture

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enabling SIMT Execution Model on Homogeneous Multi-Core System

Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions

Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media