skip to main content
10.1145/1854273.1854301acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

An OpenCL framework for heterogeneous multicores with local memory

Published: 11 September 2010 Publication History

Abstract

In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.

References

[1]
}}S. Adve and M. Hill. Weak Ordering. In ISCA'90: Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 2--14, May 1990.
[2]
}}R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.
[3]
}}AMD. ATI CTM Guide. http://ati.amd.com/ companyinfo/researcher/documents/ATI_CTM_Guide.pdf.
[4]
}}AMD. OpenCL: The Open Standard for Parallel Programming of GPUs and Multi-core CPUs. http://ati. amd.com/technology/streamcomputing/opencl.html.
[5]
}}Apple. OpenCL: Taking the graphics processor beyond graphics. http://images.apple.com/macosx/technology/ docs/OpenCL_TB_brief_20090903.pdf.
[6]
}}J. Balart, M. Gonzalez, X. Martorell, E. Ayguade, Z. Sura, T. Chen, T. Zhang, K. O'brien, and K. O'Brien. A novel asynchronous software cache implementation for the cell/be processor. In LCPC '07: Proceedings of the 20th International Workshop on Languages and Compilers for Parallel Computing, October 2007.
[7]
}}V. Balasundaram and K. Kennedy. A Technique for Summarizing Data Access and Its Use in Parallelism Enhancing Transformations. In PLDI '89: Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation, pages 41--53, New York, NY, USA, 1989. ACM.
[8]
}}C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT'08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 72--81, October 2008.
[9]
}}S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous. In IISWC '09: Proceedings of the IEEE International Symposium on Workload Characterization, pages 44--54, Oct 2009.
[10]
}}T. Chen, H. Lin, T. Zhang, K. M. O'Brien, and J. K. O'Brien. Orchestrating Data Transfer for the cell/B.E. Processor. In ICS'08: Proceedings of the 22nd annual International Conference on Supercomputing, pages 289--298, New York, NY, USA, 2008. ACM.
[11]
}}W.-Y. Chen, C. Iancu, and K. Yelick. Communication Optimizations for Fine-Grained UPC Applications. In PACT'05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 267--278, Washington, DC, USA, 2005. IEEE Computer Society.
[12]
}}F. Darema. The SPMD Model: Past, Present and Future. Lecture Notes in Computer Science, 2131(1):1--1, January 2001.
[13]
}}A. E. Eichenberger, J. K. O'Brien, K. M. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Systems Journal, 45(1):59--84, January 2006.
[14]
}}R. S. Engelschall. Portable Multithreading: The Signal Stack Trick For User-Space Thread Creation. In Proceedings of 2000 USENIX Annual Technical Conference, pages 155--164, June 2000.
[15]
}}K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In ISCA '90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15--26, May 1990.
[16]
}}M. Gonzalez, N. Vujic, X. Martorell, E. Ayguade, A. E. Eichenberger, T. Chen, Z. Sura, T. Zhang, K. O'Brien, and K. M. O'Brien. Hybrid Access-specific Software Cache Techniques for the Cell BE Architecture. In PACT'08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 292--302, October 2008.
[17]
}}M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic Processing in Cell's Multicore Architecture. IEEE Micro, 26(2):10--24, March/April 2006.
[18]
}}C. Iancu, P. Husbands, and P. Hargrove. HUNTing the Overlap. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 279--290, Washington, DC, USA, 2005. IEEE Computer Society.
[19]
}}IBM. OpenCL Development Kit for Linux on Power. http://www.alphaworks.ibm.com/tech/opencl.
[20]
}}IBM. Software Development Kit for Multicore Acceleration version 3.1, Programmer's Guide. IBM, 2008. http://www.ibm.com/developerworks/power/cell/.
[21]
}}IBM, Sony, and Toshiba. Cell Broadband Engine Architecture. IBM, 2009. http://www.ibm.com/developerworks/power/cell/.
[22]
}}P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference, pages 115--131, January 1994.
[23]
}}Khronos OpenCL Working Group. The OpenCL Specification Version 1.0. Khronos Group, 2009. http://www.khronos.org/opencl.
[24]
}}J. Lee, J. Lee, S. Seo, J. Kim, S. Kim, and Z. Sura. COMIC++: A Software SVM System for Heterogeneous Multicore Accelerator Clusters. In HPCA'10: Proceedings of the 15th International Symposium on High Performance Computer Architecture. IEEE Computer Society, January 2010.
[25]
}}J. Lee, S. Seo, C. Kim, J. Kim, P. Chun, Z. Sura, J. Kim, and S. Han. COMIC: A Coherent Shared Memory Interface for Cell BE. In PACT'08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 303--314, October 2008.
[26]
}}T. Liu, H. Lin, T. Chen, J. K. O'Brien, and L. Shao. DBDB: Optimizing DMA Transfer for the Cell BE Architecture. In ICS'09: Proceedings of the 23rd International Conference on Supercomputing, pages 36--45, New York, NY, USA, 2009. ACM.
[27]
}}LLVM Team. The LLVM Compiler Infrastructure. http://llvm.org.
[28]
}}S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
[29]
}}NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.
[30]
}}NVIDIA. OpenCL for NVIDIA. http://www.nvidia.com/object/cuda_opencl.html.
[31]
}}NVIDIA. NVIDIA CUDA Compute Unified Device Architecture. NVIDIA, June 2008. http://developer.download.nvidia.com.
[32]
}}W. Pugh and Omega Project Team. The Omega Project: Frameworks and Algorithms for the Analysis and Transformation of Scientific Programs. http://www.cs.umd.edu/projects/omega, 2009.
[33]
}}B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming Model for a Heterogeneous x86 Platform. In PLDI'09: Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 431--440, New York, NY, USA, 2009. ACM.
[34]
}}L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, 27(3):Article 18, August 2008.
[35]
}}S. Seo, J. Lee, and Z. Sura. Design and Implementation of Software-managed Caches for Multicores with Local Memory. In HPCA'09:Proceedings of the 15th International Symposium on High Performance Computer Architecture, pages 55--66, February 2009.
[36]
}}The IMPACT Research Group. Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php, 2009.
[37]
}}R. V. van Nieuwpoort and J. W. Romein. Using many-core hardware to correlate radio astronomy signals. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 440--449, New York, NY, USA, 2009. ACM.
[38]
}}P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System. In PLDI'07: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 156--166, New York, NY, USA, 2007. ACM.
[39]
}}H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. ACM Press, 1991.

Cited By

View all
  • (2024)Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00023(186-200)Online publication date: 2-Nov-2024
  • (2024)swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputerCCF Transactions on High Performance Computing10.1007/s42514-023-00159-76:4(439-458)Online publication date: 11-Jan-2024
  • (2022)The interactive system of Bloch sphere for quantum computing education2022 IEEE International Conference on Quantum Computing and Engineering (QCE)10.1109/QCE53715.2022.00097(718-723)Online publication date: Sep-2022
  • Show More Cited By

Index Terms

  1. An OpenCL framework for heterogeneous multicores with local memory

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
      September 2010
      596 pages
      ISBN:9781450301787
      DOI:10.1145/1854273
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 September 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. OpenCL
      2. compilers
      3. memory consistency
      4. preload-poststore buffering
      5. runtime
      6. software-managed caches
      7. work-item coalescing

      Qualifiers

      • Research-article

      Conference

      PACT '10
      Sponsor:
      • IFIP WG 10.3
      • IEEE CS TCPP
      • SIGARCH
      • IEEE CS TCAA

      Acceptance Rates

      Overall Acceptance Rate 121 of 471 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)18
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00023(186-200)Online publication date: 2-Nov-2024
      • (2024)swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputerCCF Transactions on High Performance Computing10.1007/s42514-023-00159-76:4(439-458)Online publication date: 11-Jan-2024
      • (2022)The interactive system of Bloch sphere for quantum computing education2022 IEEE International Conference on Quantum Computing and Engineering (QCE)10.1109/QCE53715.2022.00097(718-723)Online publication date: Sep-2022
      • (2021)Automatic mapping and code optimization for OpenCL kernels on FT-matrix architecture (WIP paper)Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3461648.3463845(37-41)Online publication date: 22-Jun-2021
      • (2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
      • (2018)MOCLProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203244(26-35)Online publication date: 8-May-2018
      • (2018)Enabling SIMT Execution Model on Homogeneous Multi-Core SystemACM Transactions on Architecture and Code Optimization10.1145/317796015:1(1-26)Online publication date: 22-Mar-2018
      • (2018)Automatic Mapping for OpenCL-Programs on CPU/GPU Heterogeneous PlatformsComputational Science – ICCS 201810.1007/978-3-319-93701-4_23(301-314)Online publication date: 11-Jun-2018
      • (2015)Locality-centric thread scheduling for bulk-synchronous programming models on CPU architecturesProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738632(257-268)Online publication date: 7-Feb-2015
      • (2015)Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformationsFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.150003216:11(899-916)Online publication date: 7-Nov-2015
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media