skip to main content
10.1145/2259016.2259020acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Dynamic compilation of data-parallel kernels for vector processors

Published: 31 March 2012 Publication History

Abstract

Modern processors enjoy augmented throughput and power efficiency through specialized functional units leveraged via instruction set extensions. These functional units accelerate performance for specific types of operations but must be programmed explicitly. Moreover, applications targeting these specialized units will not take advantage of future ISA extensions and tend not to be portable across multiple ISAs. As architecture designers increasingly rely on heterogeneity for performance improvements, the challenges of leveraging specialized functional units will only become more critical. In particular, exploiting software parallelism without sacrificing portability across the spectrum of commodity and multi-core SIMD processors remains elusive.
This work applies dynamic compilation to explicitly data-parallel kernels and describes a set of program transformations that efficiently compile bulk-synchronous scalar kernels for SIMD functional units while tolerating control-flow divergence. It is agnostic to specific features of ISAs, and performance scalability is expected from 2-wide to arbitrary-width vector units. This technique is evaluated with existing workloads originally targeting GPU computing. A microbenchmark written in CUDA achieving near peak throughput on a GPU achieves over 90% peak throughput on an Intel Sandybridge. Speedups for real-world applications running on on CPUs featuring SSE4 achieve up to 3.9x over current state of the art heterogeneous compilers for data-parallel workloads.

References

[1]
Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual. Number 248966-018 in Intel 64 and IA-32 Optimization Manaul. Intel Corporation, March 2009.
[2]
Intel Corp. Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency, March 2008.
[3]
KHRONOS OpenCL Working Group. The OpenCL Specification, December 2008.
[4]
NVIDIA. NVIDIA CUDA Compute Unified Device Architecture. NVIDIA Corporation, Santa Clara, California, 2.1 edition, October 2008.
[5]
John Stratton and Vinod Grover et al. Efficient compilation of fine-grained spmd-threaded programs for multicore cpus. In CGO 2010, Toronto, Canada, April 2010.
[6]
Jayanth Gummaraju and Laurent Morichetti et al. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. PACT '10, pages 205--216, New York, NY, USA, 2010. ACM.
[7]
Jaejin Lee and Jungwon Kim et al. An opencl framework for heterogeneous multicores with local memory. PACT '10, pages 193--204, New York, NY, USA, 2010. ACM.
[8]
Haicheng Wu, G. Diamos, Si Li, and S. Yalamanchili. Characterization and transformation of unstructured control flow in gpu applications. In First International Workshop on Characterizing Applications for Heterogeneous Exascale Systems, June 2011.
[9]
NVIDIA. NVIDIA Compute PTX: Parallel Thread Execution. NVIDIA Corporation, Santa Clara, California, 1.3 edition, October 2008.
[10]
Larry Seiler and Doug Carmean et al. Larrabee: a many-core x86 architecture for visual computing. In ACM SIGGRAPH 2008 papers, SIGGRAPH '08, pages 18:1--18:15, New York, NY, USA, 2008. ACM.
[11]
Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintao Pereira, and Wagner Meira Jr. Divergence analysis and optimizations. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 320--329, oct. 2011.
[12]
Sylvain Collange and David Defour et al. Dynamic detection of uniform and affine vectors in gpgpu computations. Technical report, Universite de Perpignan, University of California Davis, June 2009.
[13]
Ziyu Guo, Eddy Zheng Zhang, and Xipeng Shen. Correctly treating synchronizations in compiling fine-grained spmd-threaded programs for cpu. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 310--319, oct. 2011.
[14]
Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. A characterization and analysis of ptx kernels. In IISWC'09, Austin, TX, USA, October 2009.
[15]
Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili. Gpuocelot: A binary translation framework for ptx., June 2009. http://code.google.com/p/gpuocelot/.
[16]
Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. PACT '10, pages 353--364, New York, NY, USA, 2010. ACM.
[17]
IMPACT. The parboil benchmark suite, 2007.
[18]
Volkov Vasily and Demmel James W. Benchmarking gpus to tune dense linear algebra. In Supercomputing'08, Piscataway, NJ, USA, 2008.
[19]
Ralf Karrenberg and Sebastian Hack. Whole-function vectorization. CGO, 2011.
[20]
Jaewook Shin. Introducing control flow into vectorized code. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, PACT '07, pages 280--291, Washington, DC, USA, 2007. IEEE Computer Society.
[21]
Michael Steffen and Joseph Zambreno. Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. MICRO '43, Washington, DC, USA, 2010.
[22]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM.
[23]
Nathan Clark and Amir Hormati et al. Liquid simd: Abstracting simd hardware using lightweight dynamic mapping. In HPCA '07, pages 216--227, Washington, DC, USA, 2007. IEEE Computer Society.
[24]
Rajkishore Barik, J. Zhao, and V. Sarkar. Efficient selection of vector instructions using dynamic programming. MICRO '43, pages 201--212, Washington, DC, USA, 2010. IEEE Computer Society.

Cited By

View all
  • (2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
  • (2017)Optimizations of the Whole Function Vectorization Based on SIMD CharacteristicsParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_14(152-171)Online publication date: 6-Oct-2017
  • (2016)Input space splitting for OpenCLProceedings of the 25th International Conference on Compiler Construction10.1145/2892208.2892217(251-260)Online publication date: 17-Mar-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization
March 2012
285 pages
ISBN:9781450312066
DOI:10.1145/2259016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2012

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

CGO '12

Acceptance Rates

CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
  • (2017)Optimizations of the Whole Function Vectorization Based on SIMD CharacteristicsParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_14(152-171)Online publication date: 6-Oct-2017
  • (2016)Input space splitting for OpenCLProceedings of the 25th International Conference on Compiler Construction10.1145/2892208.2892217(251-260)Online publication date: 17-Mar-2016
  • (2015)Locality-centric thread scheduling for bulk-synchronous programming models on CPU architecturesProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738632(257-268)Online publication date: 7-Feb-2015
  • (2015)The Impact of the SIMD Width on Control-Flow and Memory DivergenceACM Transactions on Architecture and Code Optimization10.1145/268735511:4(1-25)Online publication date: 9-Jan-2015
  • (2015)Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPUProceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 0310.1109/Trustcom.2015.612(53-60)Online publication date: 20-Aug-2015
  • (2015)Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2015.7054205(257-268)Online publication date: Feb-2015
  • (2015)Establishing Operational Models for Dynamic Compilation in a Simulation PlatformNature of Computation and Communication10.1007/978-3-319-15392-6_12(117-131)Online publication date: 24-Jan-2015
  • (2014)OpenCL framework for ARM processors with NEON supportProceedings of the 2014 Workshop on Programming models for SIMD/Vector processing10.1145/2568058.2568064(33-40)Online publication date: 16-Feb-2014
  • (2013)Microarchitectural mechanisms to exploit value structure in SIMT architecturesACM SIGARCH Computer Architecture News10.1145/2508148.248593441:3(130-141)Online publication date: 23-Jun-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media