research-article

Dynamic compilation of data-parallel kernels for vector processors

Authors:

Gregory Diamos,

S. YalamanchiliAuthors Info & Claims

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

Pages 23 - 32

https://doi.org/10.1145/2259016.2259020

Published: 31 March 2012 Publication History

Abstract

Modern processors enjoy augmented throughput and power efficiency through specialized functional units leveraged via instruction set extensions. These functional units accelerate performance for specific types of operations but must be programmed explicitly. Moreover, applications targeting these specialized units will not take advantage of future ISA extensions and tend not to be portable across multiple ISAs. As architecture designers increasingly rely on heterogeneity for performance improvements, the challenges of leveraging specialized functional units will only become more critical. In particular, exploiting software parallelism without sacrificing portability across the spectrum of commodity and multi-core SIMD processors remains elusive.

This work applies dynamic compilation to explicitly data-parallel kernels and describes a set of program transformations that efficiently compile bulk-synchronous scalar kernels for SIMD functional units while tolerating control-flow divergence. It is agnostic to specific features of ISAs, and performance scalability is expected from 2-wide to arbitrary-width vector units. This technique is evaluated with existing workloads originally targeting GPU computing. A microbenchmark written in CUDA achieving near peak throughput on a GPU achieves over 90% peak throughput on an Intel Sandybridge. Speedups for real-world applications running on on CPUs featuring SSE4 achieve up to 3.9x over current state of the art heterogeneous compilers for data-parallel workloads.

References

[1]

Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual. Number 248966-018 in Intel 64 and IA-32 Optimization Manaul. Intel Corporation, March 2009.

[2]

Intel Corp. Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency, March 2008.

[3]

KHRONOS OpenCL Working Group. The OpenCL Specification, December 2008.

[4]

NVIDIA. NVIDIA CUDA Compute Unified Device Architecture. NVIDIA Corporation, Santa Clara, California, 2.1 edition, October 2008.

[5]

John Stratton and Vinod Grover et al. Efficient compilation of fine-grained spmd-threaded programs for multicore cpus. In CGO 2010, Toronto, Canada, April 2010.

Digital Library

[6]

Jayanth Gummaraju and Laurent Morichetti et al. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. PACT '10, pages 205--216, New York, NY, USA, 2010. ACM.

Digital Library

[7]

Jaejin Lee and Jungwon Kim et al. An opencl framework for heterogeneous multicores with local memory. PACT '10, pages 193--204, New York, NY, USA, 2010. ACM.

Digital Library

[8]

Haicheng Wu, G. Diamos, Si Li, and S. Yalamanchili. Characterization and transformation of unstructured control flow in gpu applications. In First International Workshop on Characterizing Applications for Heterogeneous Exascale Systems, June 2011.

[9]

NVIDIA. NVIDIA Compute PTX: Parallel Thread Execution. NVIDIA Corporation, Santa Clara, California, 1.3 edition, October 2008.

[10]

Larry Seiler and Doug Carmean et al. Larrabee: a many-core x86 architecture for visual computing. In ACM SIGGRAPH 2008 papers, SIGGRAPH '08, pages 18:1--18:15, New York, NY, USA, 2008. ACM.

Digital Library

[11]

Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintao Pereira, and Wagner Meira Jr. Divergence analysis and optimizations. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 320--329, oct. 2011.

Digital Library

[12]

Sylvain Collange and David Defour et al. Dynamic detection of uniform and affine vectors in gpgpu computations. Technical report, Universite de Perpignan, University of California Davis, June 2009.

[13]

Ziyu Guo, Eddy Zheng Zhang, and Xipeng Shen. Correctly treating synchronizations in compiling fine-grained spmd-threaded programs for cpu. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 310--319, oct. 2011.

Digital Library

[14]

Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. A characterization and analysis of ptx kernels. In IISWC'09, Austin, TX, USA, October 2009.

Digital Library

[15]

Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili. Gpuocelot: A binary translation framework for ptx., June 2009. http://code.google.com/p/gpuocelot/.

[16]

Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. PACT '10, pages 353--364, New York, NY, USA, 2010. ACM.

Digital Library

[17]

IMPACT. The parboil benchmark suite, 2007.

[18]

Volkov Vasily and Demmel James W. Benchmarking gpus to tune dense linear algebra. In Supercomputing'08, Piscataway, NJ, USA, 2008.

Digital Library

[19]

Ralf Karrenberg and Sebastian Hack. Whole-function vectorization. CGO, 2011.

Digital Library

[20]

Jaewook Shin. Introducing control flow into vectorized code. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, PACT '07, pages 280--291, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[21]

Michael Steffen and Joseph Zambreno. Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. MICRO '43, Washington, DC, USA, 2010.

Digital Library

[22]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM.

Digital Library

[23]

Nathan Clark and Amir Hormati et al. Liquid simd: Abstracting simd hardware using lightweight dynamic mapping. In HPCA '07, pages 216--227, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[24]

Rajkishore Barik, J. Zhao, and V. Sarkar. Efficient selection of vector instructions using dynamic programming. MICRO '43, pages 201--212, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

Cited By

Wang SYu LHer LHwang YLee J(2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3470644
Li YGao YWang DLi YXu J(2017)Optimizations of the Whole Function Vectorization Based on SIMD CharacteristicsParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_14(152-171)Online publication date: 6-Oct-2017
https://doi.org/10.1007/978-981-10-6442-5_14
Moll SDoerfert JHack SZaks AHermenegildo M(2016)Input space splitting for OpenCLProceedings of the 25th International Conference on Compiler Construction10.1145/2892208.2892217(251-260)Online publication date: 17-Mar-2016
https://dl.acm.org/doi/10.1145/2892208.2892217
Show More Cited By

Recommendations

Tuning a Finite Difference Computation for Parallel Vector Processors
ISPDC '12: Proceedings of the 2012 11th International Symposium on Parallel and Distributed Computing

Current CPU and GPU architectures heavily use data and instruction parallelism at different levels. Floating point operations are organised in vector instructions of increasing vector length. For reasons of performance it is mandatory to use the vector ...
Efficient compilation of CUDA kernels for high-performance computing on FPGAs
Special issue on application-specific processors

The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different ...
Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

March 2012

285 pages

ISBN:9781450312066

DOI:10.1145/2259016

General Chairs:
Carol Eidt
Microsoft
,
Anne Holler
VMware
,
Program Chairs:
Uma Srinivasan
Intel
,
Saman Amarasinghe
MIT

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

CGO '12

Sponsor:

CGO '12: Annual IEEE/ACM International Symposium on Code Generation and Optimization

March 31 - April 4, 2012

California, San Jose

Acceptance Rates

CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
421
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang SYu LHer LHwang YLee J(2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3470644
Li YGao YWang DLi YXu J(2017)Optimizations of the Whole Function Vectorization Based on SIMD CharacteristicsParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_14(152-171)Online publication date: 6-Oct-2017
https://doi.org/10.1007/978-981-10-6442-5_14
Moll SDoerfert JHack SZaks AHermenegildo M(2016)Input space splitting for OpenCLProceedings of the 25th International Conference on Compiler Construction10.1145/2892208.2892217(251-260)Online publication date: 17-Mar-2016
https://dl.acm.org/doi/10.1145/2892208.2892217
Kim HEl Hajj IStratton JLumetta SHwu WOlukotun KSmith AHundt RMars J(2015)Locality-centric thread scheduling for bulk-synchronous programming models on CPU architecturesProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738632(257-268)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.5555/2738600.2738632
Schaub TMoll SKarrenberg RHack S(2015)The Impact of the SIMD Width on Control-Flow and Memory DivergenceACM Transactions on Architecture and Code Optimization10.1145/268735511:4(1-25)Online publication date: 9-Jan-2015
https://dl.acm.org/doi/10.1145/2687355
Xu SGregg D(2015)Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPUProceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 0310.1109/Trustcom.2015.612(53-60)Online publication date: 20-Aug-2015
https://dl.acm.org/doi/10.1109/Trustcom.2015.612
Kim HHajj IStratton JLumetta SHwu W(2015)Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2015.7054205(257-268)Online publication date: Feb-2015
https://doi.org/10.1109/CGO.2015.7054205
Huynh NVo THuynh HDrogoul A(2015)Establishing Operational Models for Dynamic Compilation in a Simulation PlatformNature of Computation and Communication10.1007/978-3-319-15392-6_12(117-131)Online publication date: 24-Jan-2015
https://doi.org/10.1007/978-3-319-15392-6_12
Jo GJeon WJung WTaft GLee JTanase GWu PFalcou JTanase GWu P(2014)OpenCL framework for ARM processors with NEON supportProceedings of the 2014 Workshop on Programming models for SIMD/Vector processing10.1145/2568058.2568064(33-40)Online publication date: 16-Feb-2014
https://dl.acm.org/doi/10.1145/2568058.2568064
Kim JTorng CSrinath SLockhart DBatten C(2013)Microarchitectural mechanisms to exploit value structure in SIMT architecturesACM SIGARCH Computer Architecture News10.1145/2508148.248593441:3(130-141)Online publication date: 23-Jun-2013
https://dl.acm.org/doi/10.1145/2508148.2485934
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten