research-article

An OpenCL framework for heterogeneous multicores with local memory

Authors:

Thanh Tuan Dao,

Jong-Deok ChoiAuthors Info & Claims

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Pages 193 - 204

https://doi.org/10.1145/1854273.1854301

Published: 11 September 2010 Publication History

Abstract

In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.

References

[1]

}}S. Adve and M. Hill. Weak Ordering. In ISCA'90: Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 2--14, May 1990.

Digital Library

[2]

}}R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.

Digital Library

[3]

}}AMD. ATI CTM Guide. http://ati.amd.com/ companyinfo/researcher/documents/ATI_CTM_Guide.pdf.

[4]

}}AMD. OpenCL: The Open Standard for Parallel Programming of GPUs and Multi-core CPUs. http://ati. amd.com/technology/streamcomputing/opencl.html.

[5]

}}Apple. OpenCL: Taking the graphics processor beyond graphics. http://images.apple.com/macosx/technology/ docs/OpenCL_TB_brief_20090903.pdf.

[6]

}}J. Balart, M. Gonzalez, X. Martorell, E. Ayguade, Z. Sura, T. Chen, T. Zhang, K. O'brien, and K. O'Brien. A novel asynchronous software cache implementation for the cell/be processor. In LCPC '07: Proceedings of the 20th International Workshop on Languages and Compilers for Parallel Computing, October 2007.

[7]

}}V. Balasundaram and K. Kennedy. A Technique for Summarizing Data Access and Its Use in Parallelism Enhancing Transformations. In PLDI '89: Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation, pages 41--53, New York, NY, USA, 1989. ACM.

Digital Library

[8]

}}C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT'08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 72--81, October 2008.

Digital Library

[9]

}}S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous. In IISWC '09: Proceedings of the IEEE International Symposium on Workload Characterization, pages 44--54, Oct 2009.

Digital Library

[10]

}}T. Chen, H. Lin, T. Zhang, K. M. O'Brien, and J. K. O'Brien. Orchestrating Data Transfer for the cell/B.E. Processor. In ICS'08: Proceedings of the 22nd annual International Conference on Supercomputing, pages 289--298, New York, NY, USA, 2008. ACM.

Digital Library

[11]

}}W.-Y. Chen, C. Iancu, and K. Yelick. Communication Optimizations for Fine-Grained UPC Applications. In PACT'05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 267--278, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[12]

}}F. Darema. The SPMD Model: Past, Present and Future. Lecture Notes in Computer Science, 2131(1):1--1, January 2001.

Digital Library

[13]

}}A. E. Eichenberger, J. K. O'Brien, K. M. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Systems Journal, 45(1):59--84, January 2006.

Digital Library

[14]

}}R. S. Engelschall. Portable Multithreading: The Signal Stack Trick For User-Space Thread Creation. In Proceedings of 2000 USENIX Annual Technical Conference, pages 155--164, June 2000.

Digital Library

[15]

}}K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In ISCA '90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15--26, May 1990.

Digital Library

[16]

}}M. Gonzalez, N. Vujic, X. Martorell, E. Ayguade, A. E. Eichenberger, T. Chen, Z. Sura, T. Zhang, K. O'Brien, and K. M. O'Brien. Hybrid Access-specific Software Cache Techniques for the Cell BE Architecture. In PACT'08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 292--302, October 2008.

Digital Library

[17]

}}M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic Processing in Cell's Multicore Architecture. IEEE Micro, 26(2):10--24, March/April 2006.

Digital Library

[18]

}}C. Iancu, P. Husbands, and P. Hargrove. HUNTing the Overlap. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 279--290, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[19]

}}IBM. OpenCL Development Kit for Linux on Power. http://www.alphaworks.ibm.com/tech/opencl.

[20]

}}IBM. Software Development Kit for Multicore Acceleration version 3.1, Programmer's Guide. IBM, 2008. http://www.ibm.com/developerworks/power/cell/.

[21]

}}IBM, Sony, and Toshiba. Cell Broadband Engine Architecture. IBM, 2009. http://www.ibm.com/developerworks/power/cell/.

[22]

}}P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference, pages 115--131, January 1994.

Digital Library

[23]

}}Khronos OpenCL Working Group. The OpenCL Specification Version 1.0. Khronos Group, 2009. http://www.khronos.org/opencl.

[24]

}}J. Lee, J. Lee, S. Seo, J. Kim, S. Kim, and Z. Sura. COMIC++: A Software SVM System for Heterogeneous Multicore Accelerator Clusters. In HPCA'10: Proceedings of the 15th International Symposium on High Performance Computer Architecture. IEEE Computer Society, January 2010.

[25]

}}J. Lee, S. Seo, C. Kim, J. Kim, P. Chun, Z. Sura, J. Kim, and S. Han. COMIC: A Coherent Shared Memory Interface for Cell BE. In PACT'08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 303--314, October 2008.

Digital Library

[26]

}}T. Liu, H. Lin, T. Chen, J. K. O'Brien, and L. Shao. DBDB: Optimizing DMA Transfer for the Cell BE Architecture. In ICS'09: Proceedings of the 23rd International Conference on Supercomputing, pages 36--45, New York, NY, USA, 2009. ACM.

Digital Library

[27]

}}LLVM Team. The LLVM Compiler Infrastructure. http://llvm.org.

[28]

}}S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.

Digital Library

[29]

}}NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.

[30]

}}NVIDIA. OpenCL for NVIDIA. http://www.nvidia.com/object/cuda_opencl.html.

[31]

}}NVIDIA. NVIDIA CUDA Compute Unified Device Architecture. NVIDIA, June 2008. http://developer.download.nvidia.com.

[32]

}}W. Pugh and Omega Project Team. The Omega Project: Frameworks and Algorithms for the Analysis and Transformation of Scientific Programs. http://www.cs.umd.edu/projects/omega, 2009.

[33]

}}B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming Model for a Heterogeneous x86 Platform. In PLDI'09: Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 431--440, New York, NY, USA, 2009. ACM.

Digital Library

[34]

}}L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, 27(3):Article 18, August 2008.

Digital Library

[35]

}}S. Seo, J. Lee, and Z. Sura. Design and Implementation of Software-managed Caches for Multicores with Local Memory. In HPCA'09:Proceedings of the 15th International Symposium on High Performance Computer Architecture, pages 55--66, February 2009.

[36]

}}The IMPACT Research Group. Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php, 2009.

[37]

}}R. V. van Nieuwpoort and J. W. Romein. Using many-core hardware to correlate radio astronomy signals. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 440--449, New York, NY, USA, 2009. ACM.

Digital Library

[38]

}}P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System. In PLDI'07: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 156--166, New York, NY, USA, 2007. ACM.

Digital Library

[39]

}}H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. ACM Press, 1991.

Cited By

Han RZhao JKim H(2024)Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00023(186-200)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00023
Yu MMa GWang ZTang SChen YWang YLiu YJia DWei Z(2024)swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputerCCF Transactions on High Performance Computing10.1007/s42514-023-00159-76:4(439-458)Online publication date: 11-Jan-2024
https://doi.org/10.1007/s42514-023-00159-7
Liao YCheng YZhang YWu HLu R(2022)The interactive system of Bloch sphere for quantum computing education2022 IEEE International Conference on Quantum Computing and Engineering (QCE)10.1109/QCE53715.2022.00097(718-723)Online publication date: Sep-2022
https://doi.org/10.1109/QCE53715.2022.00097
Show More Cited By

Index Terms

An OpenCL framework for heterogeneous multicores with local memory
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation

Recommendations

An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques

Recently, Intel has introduced a research prototype many core processor called the Single-chip Cloud Computer (SCC). The SCC is an experimental processor created by Intel Labs. It contains 48 cores in a single chip and each core has its own L1 and L2 ...
Achieving a single compute device image in OpenCL for multiple GPUs
PPoPP '11

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the ...
Achieving a single compute device image in OpenCL for multiple GPUs
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

September 2010

596 pages

ISBN:9781450301787

DOI:10.1145/1854273

General Chair:
Valentina Salapura
IBM TJ Watson Research Center
,
Program Chairs:
Michael Gschwind
IBM Systems & Technology Group
,
Jens Knoop
Technische Universität Wien

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP working group 10.3 on concurrent systems
IEEE CS TCPP: IEEE-CS technical committee on parallel processing
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '10

Sponsor:

IFIP WG 10.3
IEEE CS TCPP
SIGARCH
IEEE CS TCAA

PACT '10: International Conference on Parallel Architectures and Compilation Techniques

September 11 - 15, 2010

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

50
Total Citations
View Citations
2,352
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Han RZhao JKim H(2024)Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00023(186-200)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00023
Yu MMa GWang ZTang SChen YWang YLiu YJia DWei Z(2024)swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputerCCF Transactions on High Performance Computing10.1007/s42514-023-00159-76:4(439-458)Online publication date: 11-Jan-2024
https://doi.org/10.1007/s42514-023-00159-7
Liao YCheng YZhang YWu HLu R(2022)The interactive system of Bloch sphere for quantum computing education2022 IEEE International Conference on Quantum Computing and Engineering (QCE)10.1109/QCE53715.2022.00097(718-723)Online publication date: Sep-2022
https://doi.org/10.1109/QCE53715.2022.00097
Zhao XWen MChen ZShi YZhang CHenkel JLiu X(2021)Automatic mapping and code optimization for OpenCL kernels on FT-matrix architecture (WIP paper)Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3461648.3463845(37-41)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3461648.3463845
Liu YHuang LWu MCui HLv FFeng XXue JAmaral JKulkarni M(2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307350
Zhang PFang JYang CTang THuang CWang ZKaeli DPericàs M(2018)MOCLProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203244(26-35)Online publication date: 8-May-2018
https://dl.acm.org/doi/10.1145/3203217.3203244
Chen KChen C(2018)Enabling SIMT Execution Model on Homogeneous Multi-Core SystemACM Transactions on Architecture and Code Optimization10.1145/317796015:1(1-26)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3177960
Moren KGöhringer D(2018)Automatic Mapping for OpenCL-Programs on CPU/GPU Heterogeneous PlatformsComputational Science – ICCS 201810.1007/978-3-319-93701-4_23(301-314)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1007/978-3-319-93701-4_23
Kim HEl Hajj IStratton JLumetta SHwu WOlukotun KSmith AHundt RMars J(2015)Locality-centric thread scheduling for bulk-synchronous programming models on CPU architecturesProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738632(257-268)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.5555/2738600.2738632
Wen MHuang DXun CChen D(2015)Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformationsFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.150003216:11(899-916)Online publication date: 7-Nov-2015
https://doi.org/10.1631/FITEE.1500032
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten