A Survey of CPU-GPU Heterogeneous Computing Techniques

Authors:
Sparsh Mittal

Oak Ridge National Laboratory, Tennessee, USA

Oak Ridge National Laboratory, Tennessee, USA
View Profile

,
Jeffrey S. Vetter

Oak Ridge National Laboratory and Georgia Tech, Tennessee, USA

Oak Ridge National Laboratory and Georgia Tech, Tennessee, USA
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 47 Issue 4Article No.: 69pp 1–35https://doi.org/10.1145/2788396

Published:21 July 2015Publication History

ACM Computing Surveys

Abstract

As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

References

Alejandro Acosta, Robert Corujo, Vicente Blanco, and Francisco Almeida. 2010. Dynamic load balancing on heterogeneous multicore/multi-GPU systems. In International Conference on High Performance Computing and Simulation (HPCS). 467--476.Google Scholar
Jose Ignacio Agulleiro, Francisco Vazquez, Ester M. Garzon, and Jose J. Fernandez. 2012. Hybrid computing: CPU+ GPU co-processing and its application to tomographic reconstruction. Ultramicroscopy 115 (2012), 109--114.Google ScholarCross Ref
Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Samuel Thibault, and Stanimire Tomov. 2011. QR factorization on a multicore node enhanced with multiple GPU accelerators. IEEE International Parallel & Distributed Processing Symposium, 932--943. Google ScholarDigital Library
Omer Erdil Albayrak, Ismail Akturk, and Ozcan Ozturk. 2012. Effective kernel mapping for OpenCL applications in heterogeneous platforms. In 41st International Conference on Parallel Processing Workshops (ICPPW). IEEE, 81--88. Google ScholarDigital Library
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando D. Quesada, and Tomás Ramírez. 2013. Hybrid-parallel algorithms for 2D Green’s functions. Procedia Computer Science 18 (2013), 541--550.Google ScholarCross Ref
Hartwig Anzt, Vincent Heuveline, José I. Aliaga, Maribel Castillo, Juan C. Fernandez, Rafael Mayo, and Enrique S. Quintana-Orti. 2011. Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms. In International Green Computing Conference and Workshops (IGCC). IEEE, 1--6. Google ScholarDigital Library
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187--198. Google ScholarDigital Library
Eduard Ayguade, Rosa M. Badia, Daniel Cabrera, Alejandro Duran, Marc Gonzalez, Francisco Igual, Daniel Jimenez, Jesus Labarta, Xavier Martorell, Rafael Mayo, Josep M. Perez, and Enrique S. Quintana-Ortí. 2009. A proposal to extend the OpenMP tasking model for heterogeneous architectures. In Evolving OpenMP in an Age of Extreme Parallelism. 154--167. Google ScholarDigital Library
Ana Balevic and Bart Kienhuis. 2011. An efficient stream buffer mechanism for dataflow execution on heterogeneous platforms with GPUs. In First Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM). IEEE, 53--57. Google ScholarDigital Library
Dip Sankar Banerjee, Aman Kumar Bahl, and Kishore Kothapalli. 2012. An on-demand fast parallel pseudo random number generator with applications. In InternationalParallel & Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). 1703--1711. Google ScholarDigital Library
Dip Sankar Banerjee and Kishore Kothapalli. 2011. Hybrid algorithms for list ranking and graph connected components. In International Conference on High Performance Computing. 1--10. Google ScholarDigital Library
Michela Becchi, Surendra Byna, Srihari Cadambi, and Srimat Chakradhar. 2010. Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In 22nd ACM Symposium on Parallelism in Algorithms and Architectures. 82--91. Google ScholarDigital Library
Mehmet E. Belviranli, Laxmi N. Bhuyan, and Rajiv Gupta. 2013. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 57. Google ScholarDigital Library
Peter Benner, Pablo Ezzatti, Daniel Kressner, Enrique S. Quintana-Ortí, and Alfredo Remón. 2011. A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU--GPU platforms. Parallel Computing 37, 8 (2011), 439--450. Google ScholarDigital Library
Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí, and Alfredo Remón. 2010. Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function. In Euro-Par 2009—Parallel Processing Workshops. 132--139. Google ScholarDigital Library
Gregorio Bernabé, Javier Cuenca, and Domingo Giménez. 2013. Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. Procedia Computer Science 18, 319--328.Google ScholarCross Ref
Kiran Bhaskaran-Nair, Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, Hubertus J. J. van Dam, Edoardo Aprà, and Karol Kowalski. 2013. Non-iterative multireference coupled cluster methods on heterogeneous CPU-GPU systems. Journal of Chemical Theory and Computation 9, 4 (2013), 1949--1957.Google ScholarCross Ref
Alecio P. D. Binotto, Christian Daniel, Daniel Weber, Arjan Kuijper, Andre Stork, Carlos Pereira, and Dieter Fellner. 2010. Iterative SLE solvers over a CPU-GPU platform. In International Conference on High Performance Computing and Communications. 305--313. Google ScholarDigital Library
Alecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andre Stork, and Dieter W. Fellner. 2011. An effective dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms. In IEEE 13th International Conference on High Performance Computing and Communications (HPCC). 78--85. Google ScholarDigital Library
Murilo Boratto, Pedro Alonso, Carla Ramiro, and Marcos Barreto. 2012. Heterogeneous computational model for landform attributes representation on multicore and multi-GPU systems. Procedia Computer Science 9 (2012), 47--56.Google ScholarCross Ref
Michael Boyer, Kevin Skadron, Shuai Che, and Nuwan Jayasena. 2013. Load balancing in a changing world: Dealing with heterogeneity and performance variability. In ACM International Conference on Computing Frontiers. Google ScholarDigital Library
Alexander Branover, Denis Foley, and Maurice Steinman. 2012. AMD fusion APU: Llano. IEEE Micro, 32, 2 (2012), 28--37. Google ScholarDigital Library
Sebastian Breß, Felix Beier, Hannes Rauhe, Kai-Uwe Sattler, Eike Schallehn, and Gunter Saake. 2013. Efficient co-processor utilization in database query processing. Information Systems 38, 8 (2013), 1084--1096. Google ScholarDigital Library
Jun Chai, Huayou Su, Mei Wen, Xing Cai, Nan Wu, and Chunyuan Zhang. 2013. Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference. The Journal of Supercomputing 66, 1 (2013), 364--380. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization. 44--54. Google ScholarDigital Library
Bo Chen, Yun Xu, Jiaoyun Yang, and Haitao Jiang. 2010. A new parallel method of Smith-Waterman algorithm on a heterogeneous platform. In Algorithms and Architectures for Parallel Processing. Springer, 79--90. Google ScholarDigital Library
Linchuan Chen, Xin Huo, and Gagan Agrawal. 2012. Accelerating mapreduce on a coupled CPU-GPU architecture. In SC’12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 25:1--25:11. Google ScholarDigital Library
Hong Jun Choi, Dong Oh Son, Seung Gu Kang, Jong Myon Kim, Hsien-Hsin Lee, and Cheol Hong Kim. 2013. An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. The Journal of Supercomputing 65, 2 (2013), 886--902. Google ScholarDigital Library
Siddharth Choudhary, Shubham Gupta, and P. J. Narayanan. 2012. Practical time bundle adjustment for 3D reconstruction on the GPU. In Trends and Topics in Computer Vision, Lecture Notes in Computer Science, Volume 6554. Springer, Berlin, 423--435. Google ScholarDigital Library
David Clarke, Aleksandar Ilic, Alexey Lastovetsky, and Leonel Sousa. 2012. Hierarchical partitioning algorithm for scientific computing on highly heterogeneous CPU+ GPU clusters. In Euro-Par Parallel Processing, Lecture Notes in Computer Science, Volume 7484. Springer, Berlin, 489--501. Google ScholarDigital Library
Christian Conti, Diego Rossinelli, and Petros Koumoutsakos. 2012. GPU and APU computations of finite time lyapunov exponent fields. Jouranl of Computational Physics 231, 5 (2012), 2229--2244. Google ScholarDigital Library
J. R. da S. Junior, Esteban W. Clua, Anselmo Montenegro, and Paulo A. Pagliosa. 2010. Fluid simulation with two-way interaction rigid body using a heterogeneous GPU and CPU environment. In Brazilian Symposium on Games and Digital Entertainment (SBGAMES). IEEE, 156--164. Google ScholarDigital Library
Mayank Daga, Ashwin M. Aji, and Wu-chun Feng. 2011. On the efficacy of a fused CPU+ GPU processor (or APU) for parallel computing. In Symposium on Application Accelerators in High-Performance Computing (SAAHPC). IEEE, 141--149. Google ScholarDigital Library
Satish Damaraju, Varghese George, Sanjeev Jahagirdar, Tanveer Khondker, Robert Milstrey, Sanjib Sarkar, Scott Siers, Israel Stolero, and Arun Subbiah. 2012. A 22nm IA multi-CPU and GPU system-on-chip. In IEEE International Solid-State Circuits Conference Digest of Technical Papers. 56--57.Google ScholarCross Ref
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU’10). 63--74. Google ScholarDigital Library
Michael Christopher Delorme. 2013. Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit. Master of Applied Science Thesis, University of Toronto.Google Scholar
Aditya Deshpande, Ishan Misra, and P. J. Narayanan. 2011. Hybrid implementation of error diffusion dithering. In Proceedings of the 2011 18th International Conference on High Performance Computing. 1--10. Google ScholarDigital Library
Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC’08). 197--200. Google ScholarDigital Library
Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. 2009. Using graphics processors for high performance IR query processing. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 421--430. Google ScholarDigital Library
Adam Dziekonski, Adam Lamecki, and Michal Mrozowski. 2011. Tuning a hybrid GPU-CPU V-Cycle multilevel preconditioner for solving large real and complex systems of FEM equations. IEEE Antennas and Wireless Propagation Letters 10 (2011), 619--622.Google ScholarCross Ref
Toshio Endo, Akira Nukada, Satoshi Matsuoka, and Naoya Maruyama. 2010. Linpack evaluation on a supercomputer with heterogeneous accelerators. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS’’10). 1--8.Google ScholarCross Ref
Eric J. Fluhr, Joshua Friedrich, Daniel Dreps, Victor Zyuban, Gregory Still, Christopher Gonzalez, Allen Hall, David Hogenmiller, Frank Malgioglio, Ryan Nett, and others. 2014. 5.1 POWER8^TM: A 12-core server-class processor in 22nm SOI with 7.6 Tb/s off-chip bandwidth. In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’14). 96--97.Google ScholarCross Ref
Peng Cheng Gao, Yu Bo Tao, Zhi Hui Bai, and Hai Lin. 2012. Mapping the SBR and TW-ILDCs to heterogeneous CPU-GPU architecture for fast computation of electromagnetic scattering. Progress In Electromagnetics Research 122 (2012), 137--154.Google ScholarCross Ref
Michael T. Garba and Horacio González-vélez. 2012. Asymptotic peak utilisation in heterogeneous parallel CPU/GPU pipelines: A decentralised queue monitoring strategy. Parallel Processing Letters 22, 2 (2012).Google ScholarCross Ref
Eric Gardner. 2014. https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landing.Google Scholar
Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. 2010. An asymmetric distributed shared memory model for heterogeneous parallel systems. In ACM SIGARCH Computer Architecture News, 38 1 (March 2010), 347--358. Google ScholarDigital Library
Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, and Matei Ripeanu. 2012. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 345--354. Google ScholarDigital Library
Green500. 2014. Green500 Supercomputers. Retrieved from www.green500.org.Google Scholar
Chris Gregg, Michael Boyer, Kim Hazelwood, and Kevin Skadron. 2011. Dynamic heterogeneous scheduling decisions using historical runtime data. In Proceedings of the 2nd Workshop on Applications for Multi- and Many-Core Processors.Google Scholar
Chris Gregg, Jeff Brantley, and Kim Hazelwood. 2010. Contention-aware scheduling of parallel code for heterogeneous systems. In Proceedings of the USENIX Workshop on Hot Topics in Parallelism (HotPar’10).Google Scholar
Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11). 134--144. Google ScholarDigital Library
Dominik Grewe and Michael F. P. O’Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software. Springer, Berlin, 286--305. Google ScholarDigital Library
Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. 2010. Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 205--216. Google ScholarDigital Library
Sumit Gupta. 2014. http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/.Google Scholar
Tomoaki Hamano, Toshio Endo, and Satoshi Matsuoka. 2009. Power-aware dynamic task scheduling for heterogeneous accelerated clusters. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. 1--8. Google ScholarDigital Library
Scott S. Hampton, Sadaf R. Alam, Paul S. Crozier, and Pratul K. Agarwal. 2010. Optimal utilization of heterogeneous resources for biomolecular simulations. In Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11. Google ScholarDigital Library
David J. Hardy, John E. Stone, and Klaus Schulten. 2009. Multilevel summation of electrostatic potentials using graphics processing units. Parallel Comput. 35, 3 (2009), 164--177. Google ScholarDigital Library
Timothy D. R. Hartley, Umit Catalyurek, Antonio Ruiz, Francisco Igual, Rafael Mayo, and Manuel Ujaldon. 2008. Biomedical image analysis on a cooperative cluster of GPUs and multicores. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS’08). 15--25. Google ScholarDigital Library
Timothy D. R. Hartley, Erik Saule, and Umit V. Catalyurek. 2010. Automatic dataflow application tuning for heterogeneous systems. In Proceedings of the 2010 International Conference on High Performance Computing (HiPC’10). 1--10.Google Scholar
Kenneth Arthur Hawick and Daniel P. Playne. 2013. Parallel algorithms for hybrid multi-core CPU-GPU implementations of component labelling in critical phase models. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’13). 45--51.Google Scholar
Zhengyu He and Bo Hong. 2010. Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-hybrid platforms. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS’’10). 1--10.Google Scholar
Everton Hermann, Bruno Raffin, François Faure, Thierry Gautier, and Jérémie Allard. 2010. Multi-GPU and multi-CPU parallelization for interactive physics simulations. In Proceedings of the Euro-Par 2010-Parallel Processing, Lecture Notes in Computer Science, Volume 6272. Springer, Berlin, 235--246. Google ScholarDigital Library
Tayler H. Hetherington, Timothy G. Rogers, Lisa Hsu, Mike O’Connor, and Tor M. Aamodt. 2012. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’12). 88--98. Google ScholarDigital Library
Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, and Haibo Lin. 2010. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 217--226. Google ScholarDigital Library
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 78--88. Google ScholarDigital Library
Mitch Horton, Stanimire Tomov, and Jack Dongarra. 2011. A class of hybrid lapack algorithms for multicore and GPU architectures. In Symposium on Application Accelerators in High-Performance Computing (SAAHPC’11). IEEE, 150--158. Google ScholarDigital Library
Qi Hu, Nail A. Gumerov, and Ramani Duraiswami. 2011. Scalable fast multipole methods on distributed heterogeneous architectures. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, New York, NY, Article 36. Google ScholarDigital Library
Alan Humphrey, Qingyu Meng, Martin Berzins, and Todd Harman. 2012. Radiation modeling using the Uintah heterogeneous CPU/GPU runtime system. In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the Campus and Beyond. ACM, New York, NY, Article 4. Google ScholarDigital Library
Xin Huo, Vignesh T. Ravi, and Gagan Agrawal. 2011. Porting irregular reductions on heterogeneous CPU-GPU configurations. In Proceedings of the 18th International Conference on High Performance Computing. 1--10. Google ScholarDigital Library
Insieme Compiler. 2014. http://www.dps.uibk.ac.at/insieme/index.html.Google Scholar
International Telecommunication Union. 2012. Retrieved from http://www.itu.int/dms_pub/itu-d/opb/ind/D-IND-ICTOI-2012-SUM-PDF-E.pdf.Google Scholar
Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012. Dynamically managed data for CPU-GPU architectures. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). 165--174. Google ScholarDigital Library
Pritish Jetley, Lukasz Wesolowski, Filippo Gioachin, Laxmikant V. Kalé, and Thomas R. Quinn. 2010. Scaling hierarchical N-body simulations on GPU clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. Google ScholarDigital Library
Wei Jiang and Gagan Agrawal. 2012. Mate-CG: A map reduce-like framework for accelerating data-intensive computations on heterogeneous clusters. In Proceedings of the IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS’’12). 644--655. Google ScholarDigital Library
Víctor J. Jiménez, Lluís Vilanova, Isaac Gelado, Marisa Gil, Grigori Fursin, and Nacho Navarro. 2009. Predictive runtime code scheduling for heterogeneous architectures. In High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, Volume 5409. Springer, Berlin, 19--33. Google ScholarDigital Library
Mark Joselli, Marcelo Zamith, Esteban Clua, Anselmo Montenegro, Aura Conci, Regina Leal-Toledo, Luis Valente, Bruno Feijó, Marcos d’Ornellas, and Cesar Pozzer. 2008. Automatic dynamic task distribution between CPU and GPU for real-time systems. In Proceedings of the 11th IEEE International Conference on Computational Science and Engineering. 48--55. Google ScholarDigital Library
Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 341--352. Google ScholarDigital Library
Klaus Kofler, Ivan Grasso, Biagio Cosenza, and Thomas Fahringer. 2013. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th ACM International Conference on Supercomputing (ICS’13). Google ScholarDigital Library
Sai Kiran Korwar, Sathish Vadhiyar, and Ravi S. Nanjundiah. 2013. GPU-enabled efficient executions of radiation calculations in climate modeling. In Proceedings of the 20th International Conference on High Performance Computing (HiPC’13). IEEE, 353--361.Google Scholar
Kishore Kothapalli, Dip Sankar Banerjee, P. J. Narayanan, Surinder Sood, Aman Kumar Bahl, Shashank Sharma, Shrenik Lad, Krishna Kumar Singh, Kiran Matam, Sivaramakrishna Bharadwaj, Rohit Nigam, Parikshit Sakurikar, Aditya Deshpande, Ishan Misra, Siddharth Choudhary, and Shubham Gupta. 2013. CPU and/or GPU: Revisiting the GPU Vs. CPU Myth. arXiv preprint arXiv:1303.2171.Google Scholar
Jens Lang and Gudula Rünger. 2013. Dynamic distribution of workload between CPU and GPU for a parallel conjugate gradient method in an adaptive FEM. Procedia Computer Science 18 (2013), 299--308.Google ScholarCross Ref
Fabian Lecron, Sidi Ahmed Mahmoudi, Mohammed Benjelloun, Saïd Mahmoudi, and Pierre Manneback. 2011. Heterogeneous computing for vertebra detection and segmentation in X-ray images. Journal of Biomedical Imaging 2011, Article 5 (Jan. 2011). Google ScholarDigital Library
Changmin Lee, Won W. Ro, and Jean-Luc Gaudiot. 2012. Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids. In 16th Workshop on Interaction between Compilers and Computer Architectures (INTERACT). IEEE, 33--40. Google ScholarDigital Library
Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). 245--256. Google ScholarDigital Library
Kenneth Lee, Heshan Lin, and Wu-chun Feng. 2013a. Performance characterization of data-intensive kernels on AMD fusion architectures. Computer Science—Research and Development 28, 2--3 (May 2013), 175--184. Google ScholarDigital Library
Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 451--460. Google ScholarDigital Library
Hung-Fu Li, Tyng-Yeu Liang, and Jun-Yao Chiu. 2013. A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. In The Journal of Supercomputing 66, 1 (2013), 381--405. Google ScholarDigital Library
Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2012. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). 377--386. Google ScholarDigital Library
Linchuan Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Peiheng Zhang. 2011. Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system. In Proceedings of the 20th International Symposium on High Performance Distributed Computing. ACM, New York, NY, 195--204. Google ScholarDigital Library
Cong Liu, Jian Li, Wei Huang, Juan Rubio, Evan Speight, and Xiaozhu Lin. 2012. Power-efficient time-sensitive mapping in heterogeneous systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 23--32. Google ScholarDigital Library
Ding Liu, Ruixuan Li, Xiwu Gu, Kunmei Wen, Heng He, and Guoqiang Gao. 2011. Fast snippet generation based on CPU-GPU hybrid system. In Proceedings of the IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS’11). 252--259. Google ScholarDigital Library
Qiang Liu and Wayne Luk. 2012. Heterogeneous systems for energy efficient scientific computing. In Reconfigurable Computing: Architectures, Tools and Applications. Springer, 64--75. Google ScholarDigital Library
Wenjie Liu, Zhihui Du, Yu Xiao, David A. Bader, and Chen Xu. 2011. A waterfall model to achieve energy efficient tasks mapping for large scale GPU clusters. In Proceedings of the International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW’11). 82--92. Google ScholarDigital Library
Yixun Liu, Andriy Fedorov, Ron Kikinis, and Nikos Chrisochoides. 2009. Real-time non-rigid registration of medical images on a cooperative parallel architecture. In IEEE International Conference on Bioinformatics and Biomedicine. 401--404. Google ScholarDigital Library
Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. 2011. A scalable high performant Cholesky factorization for multicore with GPU accelerators. In High Performance Computing for Computational Science--VECPAR 2010, Lecture Notes in Computer Science, Volume 6449. Springer, Berlin, 93--101. Google ScholarDigital Library
Fengshun Lu, Junqiang Song, Xiaoqun Cao, and Xiaoqian Zhu. 2012a. CPU/GPU computing for long-wave radiation physics on large GPU clusters. Computers & Geosciences 41 (April 2012), 47--55. Google ScholarDigital Library
Fengshun Lu, Junqiang Song, Fukang Yin, and Xiaoqian Zhu. 2012b. Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters. Computer Physics Communications 183, 6 (2012), 1172--1181.Google ScholarCross Ref
Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd International Symposium on Microarchitecture (MICRO). ACM, New York, NY, 45--55. Google ScholarDigital Library
Li Luo, Chao Yang, Yubo Zhao, and Xiao-Chuan Cai. 2011. A scalable hybrid algorithm based on domain decomposition and algebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs. In Proceedings of the 2nd International Workshop on GPUs and Scientific Applications. 45--50.Google Scholar
Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang. 2012. GreenGPU: A holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In Proceedings of the 41st International Conference on Parallel Processing (ICPP). IEEE, 48--57. Google ScholarDigital Library
Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, Karol Kowalski, and Gagan Agrawal. 2013. Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Cluster Computing 16, 1 (March 2013), 131--155. Google ScholarDigital Library
Artur Mariano, Ricardo Alves, Joao Barbosa, Luis Paulo Santos, and Alberto Proenca. 2012. A (ir) regularity-aware task scheduler for heterogeneous platforms. In Proceedings of the International Conference on High Performance Computing.Google Scholar
Kiran Kumar Matam, Siva Rama Krishna Bharadwaj, and Kishore Kothapalli. 2012. Sparse matrix matrix multiplication on hybrid CPU+ GPU platforms. In Proceedings of the High Performance Computing Conference (HiPC’12).Google Scholar
Jeremy S. Meredith, Philip C. Roth, Kyle L. Spafford, and Jeffrey S. Vetter. 2011. Performance implications of nonuniform device topologies in scalable heterogeneous architectures. IEEE Micro 31, 5 (2011), 66--75. Google ScholarDigital Library
Perhaad Mistry, Yash Ukidave, Dana Schaa, and David Kaeli. 2013a. A framework for profiling and performance monitoring of heterogeneous applications. Programmability Issues for Heterogeneous Multicores (MULTIPROG-2013).Google Scholar
Perhaad Mistry, Yash Ukidave, Dana Schaa, and David Kaeli. 2013b. Valar: A benchmark suite to study the dynamic behavior of heterogeneous systems. In Proceedings of the 6th Workshop on General Purpose Processor using Graphics Processing Units (GPGPU’13). ACM, New York, NY, 54--65. Google ScholarDigital Library
Sparsh Mittal. 2012. A survey of architectural techniques for DRAM power management. International Journal of High Performance Systems Architecture 4, 2 (Dec. 2012), 110--119. Google ScholarDigital Library
Sparsh Mittal. 2014a. A survey of techniques for managing and leveraging caches in GPUs. Journal of Circuits, Systems, and Computers (JCSC) 23, 8 (2014).Google Scholar
Sparsh Mittal. 2014b. A survey of architectural techniques for improving cache power efficiency. Elsevier Sustainable Computing: Informatics and Systems 4, 1 (2014), 33--43.Google ScholarCross Ref
Sparsh Mittal. 2014c. A survey of techniques for improving energy efficiency in embedded computing systems. International Journal of Computer Aided Engineering and Technology (IJCAET) 46, 4, Article 47 (April 2014).Google Scholar
Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of methods for analyzing and improving GPU energy efficiency. ACM Computing Surveys 47, 2, Article 19 (2015). Google ScholarDigital Library
Timothy Prickett Morgan. 2014. Oracle Cranks up the Cores to 32 with Sparc M7 Chip. Retrieved from http://www.enterprisetech.com/2014/08/13/oracle-cranks-cores-32-sparc-m7-chip/.Google Scholar
Lluis-Miquel Munguia, David A. Bader, and Eduard Ayguade. 2012. Task-based parallel breadth-first search in heterogeneous environments. In Proceedings of the 19th International Conference on High Performance Computing (HiPC’12). 1--10.Google ScholarCross Ref
Jun-ichi Muramatsu, Takeshi Fukaya, Shao-Liang Zhang, Kinji Kimura, and Yusaku Yamamoto. 2011. Acceleration of Hessenberg reduction for nonsymmetric eigenvalue problems in a hybrid CPU-GPU computing environment. International Journal of Networking and Computing 1, 2 (2011).Google Scholar
Alin Muraraşu, Josef Weidendorfer, and Arndt Bode. 2012. Workload balancing on heterogeneous systems: A case study of sparse grid interpolation. In Euro-Par 2011: Parallel Processing Workshops, Lecture Notes in Computer Science, Volume 7156. Springer, Berlin, 345--354. Google ScholarDigital Library
Naohito Nakasato, Go Ogiya, Yohei Miki, Masao Mori, and Ken’ichi Nomoto. 2012. Astrophysical particle simulations on heterogeneous CPU-GPU systems. In arXiv preprint arXiv:1206.1199.Google Scholar
Andrew Nere, Sean Franey, Atif Hashmi, and Mikko Lipasti. 2012. Simulating cortical networks on heterogeneous multi-GPU systems. J. Parallel and Distrib. Comput. 43, 7 (July 2012), 953--971.Google Scholar
Rohit Nigam and P. J. Narayanan. 2012. Hybrid ray tracing and path tracing of Bezier surfaces using a mixed hierarchy. In Proceedings of the 8th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP’12). Article 35, 35:1--35:8. Google ScholarDigital Library
NVIDIA. 2015. http://www.geforce.com/hardware/desktop-gpus.Google Scholar
Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Jinpil Lee, and Mitsuhisa Sato. 2012. GPU/CPU work sharing with parallel language XcalableMP-dev for parallelized accelerated computing. In Proceedings of the 41st International Conference on Parallel Processing Workshops (ICPPW’12). 97--106. Google ScholarDigital Library
Yasuhito Ogata, Toshio Endo, Naoya Maruyama, and Satoshi Matsuoka. 2008. An efficient, model-based CPU-GPU heterogeneous FFT library. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--10.Google Scholar
Satoshi Ohshima, Kenji Kise, Takahiro Katagiri, and Toshitsugu Yuba. 2007. Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment. In Proceedings of the 7th International Conference on High Performance Computing for Computational Science-VECPAR 2006. Springer, 305--318. Google ScholarDigital Library
OpenACC Standard. 2014. Homepage. Retrieved from http://www.openacc-standard.org/.Google Scholar
OpenMP 4.0. 2014. Homepage. Retrieved from http://openmp.org/wp/2013/07/openmp-40/.Google Scholar
Edson Luiz Padoin, Laércio Lima Pilla, Francieli Zanon Boito, Rodrigo Virote Kassick, Pedro Velho, and Philippe O. A. Navaux. 2013. Evaluating application performance and energy consumption on hybrid CPU+ GPU architecture. Cluster Computing 16, 3 (Sept. 2013), 511--525. Google ScholarDigital Library
Sreepathi Pai, Ramaswamy Govindarajan, and Matthew Jacob Thazhuthaveetil. 2010. PLASMA: Portable programming for SIMD heterogeneous accelerators. In Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU.Google Scholar
Anthony Pajot, Loïc Barthe, Mathias Paulin, and Pierre Poulin. 2011. Combinatorial bidirectional path-tracing for efficient hybrid CPU/GPU rendering. In Computer Graphics Forum 30, 2 (April 2011), 315--324.Google ScholarCross Ref
Prasanna Pandit and R. Govindarajan. 2014. Fluidic kernels: Cooperative execution of OpenCL programs on multiple heterogeneous devices. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). Article 273, 273:273--273:283. Google ScholarDigital Library
Jairo Panetta, Thiago Teixeira, Paulo R. P. de Souza Filho, Carlos A. da Cunha Filho, David Sotelo, Fernando M. Roxo da Motta, Silvio Sinedino Pinheiro, Ivan Pedrosa Junior, Andre L. Romanelli Rosa, Luiz R. Monnerat, Leandro T. Carneiro, and Carlos H. B. de Albrecht. 2009. Accelerating Kirchhoff migration by CPU and GPU cooperation. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’09). 26--32. Google ScholarDigital Library
Manolis Papadrakakis, George Stavroulakis, and Alexander Karatarakis. 2011. A new era in scientific computing: Domain decomposition methods in hybrid CPU--GPU architectures. Computer Methods in Applied Mechanics and Engineering 200, 13 (2011), 1490--1508.Google ScholarCross Ref
Song Jun Park, James A. Ross, Dale R. Shires, David A. Richie, Brian J. Henz, and Lam H. Nguyen. 2011. Hybrid core acceleration of UWB SIRE radar signal processing. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 46--57. Google ScholarDigital Library
Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman Amarasinghe. 2013. Portable performance on heterogeneous architectures. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. 431--444. Google ScholarDigital Library
Jacques A. Pienaar, Srimat Chakradhar, and Anand Raghunathan. 2012. Automatic generation of software pipelines for heterogeneous parallel systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). Article 24. 1--12. Google ScholarDigital Library
Jacques A. Pienaar, Anand Raghunathan, and Srimat Chakradhar. 2011. MDR: Performance model driven runtime for heterogeneous parallel platforms. In Proceedings of the ACM International Conference on Supercomputing. ACM, New York, NY, 225--234. Google ScholarDigital Library
Holger Pirk, Thibault Sellam, Stefan Manegold, and Martin Kersten. 2012. X-device query processing by bitwise distribution. In 8th International Workshop on Data Management on New Hardware. ACM, New York, NY, 48--54. Google ScholarDigital Library
Usman Pirzada. 2015. Nvidia Geforce GTX TITAN X Unveiled - GM200 ‘Big Daddy Maxwell’, 12GB VRam and 8 Billion Transistors. Retrieved from wccftech.com/nvidia-gtx-titan-x-revealed-gdc-2015/.Google Scholar
Matthew Poremba, Sparsh Mittal, Dong Li, Jeffrey Vetter, and Yuan Xie. 2015. DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches. In DATE. 1543--1546. Google ScholarDigital Library
Ashwin Prasad, Jayvant Anantpur, and R. Govindarajan. 2011. Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors. In ACM Sigplan Notices 46, 6 (June 2011), 152--163. Google ScholarDigital Library
Abtin Rahimian, Ilya Lashuk, Shravan Veerapaneni, Aparna Chandramowlishwaran, Dhairya Malhotra, Logan Moon, Rahul Sampath, Aashay Shringarpure, Jeffrey Vetter, Richard Vuduc, Denis Zorin, and George Biros. 2010. Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10). 1--11. Google ScholarDigital Library
Vignesh T. Ravi and Gagan Agrawal. 2011. A dynamic scheduling framework for emerging heterogeneous systems. In Proceedings of the 18th International Conference on High Performance Computing (HiPC’11). 1--10. Google ScholarDigital Library
Vignesh T. Ravi, Wenjing Ma, David Chiu, and Gagan Agrawal. 2010. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM International Conference on Supercomputing. ACM, New York, NY, 137--146. Google ScholarDigital Library
Vignesh T. Ravi, Wenjing Ma, David Chiu, and Gagan Agrawal. 2012. Compiler and runtime support for enabling reduction computations on heterogeneous systems. Concurrency and Computation: Practice and Experience 24, 5 (2012), 463--480. Google ScholarDigital Library
Mahsan Rofouei, Thanos Stathopoulos, Sebi Ryffel, William Kaiser, and Majid Sarrafzadeh. 2008. Energy-aware high performance computing with graphic processing units. In Proceedings of the 2008 Conference on Power Aware Computing and Systems. Google ScholarDigital Library
Bratin Saha, Xiaocheng Zhou, Hu Chen, Ying Gao, Shoumeng Yan, Mohan Rajagopalan, Jesse Fang, Peinan Zhang, Ronny Ronen, and Avi Mendelson. 2009. Programming model for a heterogeneous x86 platform. In ACM Sigplan Notices 44, 6 (2009), 431--440. Google ScholarDigital Library
Thomas R. W. Scogland, Wu-chun Feng, Barry Rountree, and Bronis R. de Supinski. 2014. CoreTSAR: Adaptive worksharing for heterogeneous systems. In Supercomputing, Lecture Notes in Computer Science, Volume 8488. Springer International Publishing, 172--186. Google ScholarDigital Library
Thomas R. W. Scogland, Barry Rountree, Wu-chun Feng, and Bronis R. de Supinski. 2012. Heterogeneous task scheduling for accelerated OpenMP. In Proceedings of the IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS’’12). 144--155. Google ScholarDigital Library
Jie Shen, Ana Lucia Varbanescu, Henk Sips, Michael Arntzen, and Dick G. Simons. 2013. Glinda: A framework for accelerating imbalanced applications on heterogeneous platforms. In Proceedings of the ACM International Conference on Computing Frontiers. ACM, New York, NY, Article 14. Google ScholarDigital Library
Wenfeng Shen, Daming Wei, Weimin Xu, Xin Zhu, and Shizhong Yuan. 2010. Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU. Computer Methods and Programs in Biomedicine 100, 1 (2010), 87--96. Google ScholarDigital Library
Takashi Shimokawabe, Takayuki Aoki, Tomohiro Takaki, Akinori Yamanaka, Akira Nukada, Toshio Endo, Naoya Maruyama, and Satoshi Matsuoka. 2011. Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY. Article 3. Google ScholarDigital Library
Koichi Shirahata, Hitoshi Sato, and Satoshi Matsuoka. 2010. Hybrid map task scheduling for GPU-based heterogeneous clusters. In Proceedings of the IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom’10). 733--740. Google ScholarDigital Library
Sambit K. Shukla and Laxmi N. Bhuyan. 2013. A hybrid shared memory heterogeneous execution platform for PCIe-based GPGPUs. In 20th International Conference on High Performance Computing. 343--352.Google Scholar
Jakob Siegel, Oreste Villa, Sriram Krishnamoorthy, Antonino Tumeo, and Xiaoming Li. 2010. Efficient sparse matrix-matrix multiplication on heterogeneous high performance systems. In 2010 IEEE International Conference on Cluster Computing Workshops. 1--8.Google ScholarCross Ref
Mark Silberstein and Naoya Maruyama. 2011. An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures. In Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR’11). ACM, New York, NY, Article 7. Google ScholarDigital Library
Jaideeep Singh and Ipseeta Aruni. 2011. Accelerating smith-waterman on heterogeneous cpu-gpu systems. In Proceedings of the 5th International Conference on Bioinformatics and Biomedical Engineering (iCBBE’11). 1--4.Google ScholarCross Ref
Hayden K. H. So, Junying Chen, Billy YS Yiu, and Alfred C. H. Yu. 2011. Medical ultrasound imaging: To GPU or not to GPU? IEEE Micro 31, 5 (2011), 54--65. Google ScholarDigital Library
F. Sottile, C. Roedl, V. Slavnic, P. Jovanovic, D. Stankovic, P. Kestener, and Franck Houssen. 2013. GPU implementation of the DP code. Partnership for Advanced Computing in Europe.Google Scholar
Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter. 2010. Maestro: Data orchestration and tuning for OpenCL devices. In Euro-Par 2010-Parallel Processing, Lecture Notes in Computer Science, Volume 6272. Springer, Berlin, 275--286. Google ScholarDigital Library
Kyle L. Spafford, Jeremy S. Meredith, Seyong Lee, Dong Li, Philip C. Roth, and Jeffrey S. Vetter. 2012. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In Proceedings of the 9th Conference on Computing Frontiers. 103--112. Google ScholarDigital Library
Tomasz P. Stefanski. 2013. Implementation of FDTD-compatible Green’s function on heterogeneous CPU-GPU parallel processing system. Progress in Electromagnetics Research 135 (2013), 297--316.Google ScholarCross Ref
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66--73. Google ScholarDigital Library
Przemysław Stpiczynski. 2011. Solving linear recurrences on hybrid GPU accelerated manycore systems. In Proceedings of the Federated Conference on Computer Science and Information Systems. 465--470.Google Scholar
Przemysław Stpiczynski and Joanna Potiopa. 2010. Solving a kind of BVP for ODEs on heterogeneous CPU+ CUDA-enabled GPU systems. In Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT’10). IEEE, 349--353.Google ScholarCross Ref
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W. M. W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing.Google Scholar
Yu Su, Ding Ye, and Jingling Xue. 2013. Accelerating inclusion-based pointer analysis on heterogeneous CPU-GPU systems. In Proceedings of the 20th International Conference on High Performance Computing. 149--158.Google ScholarCross Ref
Enqiang Sun, Dana Schaa, Richard Bagley, Norman Rubin, and David Kaeli. 2012. Enabling task-level scheduling on heterogeneous platforms. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. ACM, New York, NY, 84--93. Google ScholarDigital Library
Hiroyuki Takizawa, Katsuto Sato, and Hiroaki Kobayashi. 2008. SPRAT: Runtime processor selection for energy-aware computing. In Proceedings of the IEEE International Conference on Cluster Computing. 386--393.Google ScholarCross Ref
Yu Shyang Tan, Bu-Sung Lee, Bingsheng He, and Roy H. Campbell. 2012. A map-reduce based framework for heterogeneous processing element cluster environments. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 57--64. Google ScholarDigital Library
George Teodoro, Tahsin M. Kurc, Tony Pan, Lee A. D. Cooper, Jun Kong, Patrick Widener, and Joel H. Saltz. 2012. Accelerating large scale image analyses on parallel, CPU-GPU equipped systems. In Proceedings of the IEEE 26th International Parallel & Distributed Processing Symposium. 1093--1104. Google ScholarDigital Library
George Teodoro, Tony Pan, Tahsin M. Kurc, Jun Kong, Lee A. D. Cooper, and Joel H. Saltz. 2013. Efficient irregular wavefront propagation algorithms on hybrid CPU-GPU machines. Parallel Comput. 39, 4--5 (2013), 189--211.Google ScholarCross Ref
George Teodoro, Rafael Sachetto, Olcay Sertel, Metin N. Gurcan, W. Meira, Umit Catalyurek, and Renato Ferreira. 2009. Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In International Conference on Cluster Computing and Workshops. 1--10.Google ScholarCross Ref
Pablo Toharia, Oscar D. Robles, Ricardo SuáRez, Jose Luis Bosque, and Luis Pastor. 2012. Shot boundary detection using Zernike moments in multi-GPU multi-CPU architectures. Journal of Parallel and Distributed Computing 72, 9 (Sept. 2012), 1127--1133. Google ScholarDigital Library
Stanimire Tomov, Rajib Nath, and Jack Dongarra. 2010. Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36, 12 (2010), 645--654. Google ScholarDigital Library
Top500. 2014. Top500 Supercomputers. www.top500.org.Google Scholar
Kuen Hung Tsoi and Wayne Luk. 2010. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, 115--124. Google ScholarDigital Library
Fernando Tsuda and Ricardo Nakamura. 2011. A technique for collision detection and 3D interaction based on parallel GPU and CPU processing. In Proceedings of the 2011 Brazilian Symposium on Games and Digital Entertainment (SBGAMES’11). 36--42. Google ScholarDigital Library
Abhishek Udupa, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2009. Synergistic execution of stream programs on multicores with accelerators. ACM Sigplan Notices 44, 7 (2009), 99--108. Google ScholarDigital Library
Yash Ukidave, Amir Kavyan Ziabari, Perhaad Mistry, Gunar Schirner, and David Kaeli. 2013. Quantifying the energy efficiency of FFT on heterogeneous platforms. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’13). 235--244.Google ScholarCross Ref
Ronald Veldema, Thorsten Blass, and Michael Philippsen. 2011. Enabling multiple accelerator acceleration for Java/OpenMP. In Proceedings of the 3rd USENIX Conference on Hot Topics in Parallelism (HotPar). Google ScholarDigital Library
Sundaresan Venkatasubramanian and Richard W. Vuduc. 2009. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, NY, 244--255. Google ScholarDigital Library
Uri Verner, Assaf Schuster, and Mark Silberstein. 2011. Processing data streams with hard real-time constraints on heterogeneous systems. In Proceedings of the international conference on Supercomputing (ICS’11). ACM, New York, NY, 120--129. Google ScholarDigital Library
Jeffrey S. Vetter and Sparsh Mittal. 2015. Opportunities for nonvolatile memory systems in extreme-scale high performance computing. Computing in Science and Engineering 17, 2 (2015), 73--82.Google ScholarDigital Library
Christof Vömel, Stanimire Tomov, and Jack Dongarra. 2012. Divide and conquer on hybrid GPU-accelerated multicore systems. SIAM Journal on Scientific Computing 34, 2 (2012), C70--C82. Google ScholarDigital Library
Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. 2010. On the limits of GPU acceleration. In Proceedings of the 2nd USENIX Conference on Hot Topics in Parallelism. Google ScholarDigital Library
Guibin Wang and Wei Song. 2011. Communication-aware task partition and voltage scaling for energy minimization on heterogeneous parallel systems. In Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’11). 327--333. Google ScholarDigital Library
Yueqing Wang, Yong Dou, Song Guo, Yuanwu Lei, and Dan Zou. 2014. CPU--GPU hybrid parallel strategy for cosmological simulations. Concurrency and Computation: Practice and Experience 26, 3 (March 2014), 748--765. Google ScholarDigital Library
Yu Wang, Haixiao Du, Mingrui Xia, Ling Ren, Mo Xu, Teng Xie, Gaolang Gong, Ningyi Xu, Huazhong Yang, and Yong He. 2013a. A hybrid CPU-GPU accelerated framework for fast mapping of high-resolution human brain connectome. PloS ONE 8, 5 (2013), e62789.Google ScholarCross Ref
Zhenning Wang, Long Zheng, Quan Chen, and Minyi Guo. 2013b. CAP: Co-scheduling based on asymptotic profiling in CPU+ GPU hybrid systems. In International Workshop on Programming Models and Applications for Multicores and Manycores. ACM, New York, NY, 107--114. Google ScholarDigital Library
Mei Wen, Huayou Su, Wenjie Wei, Nan Wu, Xing Cai, and Chunyuan Zhang. 2012. Using 1000+ GPUs and 10000+ CPUs for sedimentary basin simulations. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER). 27--35. Google ScholarDigital Library
Dieter Wendel, Ronald Kalla, Joshua Friedrich, James Kahle, Jens Leenstra, Cedric Lichtenau, Balaram Sinharoy, William Starke, and Victor Zyuban. 2010. The POWER7 processor SoC. In Proceedings of the International Conference on IC Design and Technology (ICICDT’10). 71--73.Google Scholar
Qiang Wu, Canqun Yang, Feng Wang, and Jingling Xue. 2012. A fast parallel implementation of molecular dynamics with the Morse potential on a heterogeneous petascale supercomputer. In International Parallel and Distributed Processing Symposium Workshops & PhD Forum. 140--149. Google ScholarDigital Library
Tin-Yu Wu, Wei-Tsong Lee, Chien-Yu Duan, and Tain-Wen Suen. 2013. Enhancing cloud-based servers by GPU/CPU virtualization management. In Advances in Intelligent Systems and Applications-Volume 2. Springer, 185--194.Google Scholar
Shucai Xiao, Heshan Lin, and Wu-chun Feng. 2011. Accelerating protein sequence search in a heterogeneous computing system. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium. 1212--1222. Google ScholarDigital Library
Ming Xu, Feiguo Chen, Xinhua Liu, Wei Ge, and Jinghai Li. 2012. Discrete particle simulation of gas-solid two-phase flows with multi-scale CPU-GPU hybrid computation. Chemical Engineering Journal 207-208 (Oct. 2012), 746--757.Google Scholar
Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi, and Kai Lu. 2010. Adaptive optimization for petascale heterogeneous CPU/GPU computing. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER). 19--28. Google ScholarDigital Library
Chao Yang, Wei Xue, Haohuan Fu, Lin Gan, Linfeng Li, Yangtong Xu, Yutong Lu, Jiachang Sun, Guangwen Yang, and Weimin Zheng. 2013. A peta-scalable CPU-GPU algorithm for global atmospheric simulations. In Proceedings of the 18th ACM/SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 1--12. Google ScholarDigital Library
Ping Yao, Hong An, Mu Xu, Gu Liu, Xiaoqiang Li, Yaobin Wang, and Wenting Han. 2010. CuHMMer: A load-balanced CPU-GPU cooperative bioinformatics application. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’10). IEEE, 24--30.Google Scholar
Marcelo Yuffe, Ernest Knoll, Moty Mehalel, Joseph Shor, and Tsvika Kurts. 2011. A fully integrated multi-CPU, GPU and memory controller 32nm processor. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’11). 264--266.Google ScholarCross Ref
Ziming Zhong, Vladimir Rychkov, and Alexey Lastovetsky. 2012. Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications. In Proceedings of the International Conference on Cluster Computing. 191--199. Google ScholarDigital Library
V. Zyuban, S. A. Taylor, B. Christensen, A. R. Hall, C. J. Gonzalez, J. Friedrich, F. Clougherty, J. Tetzloff, and R. Rao. 2013. IBM POWER7+ design for higher frequency at fixed power. IBM Journal of Research and Development 57, 6 (2013), 1:1--1:18. Google ScholarDigital Library

Index Terms

A Survey of CPU-GPU Heterogeneous Computing Techniques

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Heterogeneous parallel_for Template for CPU---GPU Chips

Heterogeneous processors, comprising CPU cores and a GPU, are the de facto standard in desktop and mobile platforms. In many cases it is worthwhile to exploit both the CPU and GPU simultaneously. However, the workload distribution poses a challenge when ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 47, Issue 4
July 2015
573 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2775083
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering/University of Florida/Gainesville, FL
Issue’s Table of Contents
Copyright © 2015 Public Domain
This paper is authored by an employee(s) of the United States Government and is in the public domain. Non-exclusive copying or redistribution is allowed, provided that the article citation is given and the authors and agency are clearly identified as its source.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 July 2015
- Accepted: 1 May 2015
- Revised: 1 March 2015
- Received: 1 August 2014
Published in csur Volume 47, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CPU-GPU heterogeneous/hybrid/collaborative computing
dynamic/static load balancing
fused CPU-GPU chip
pipelining
programming frameworks
workload division/partitioning
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 346
  Total Citations
  View Citations
- 11,043
  Total Downloads
- Downloads (Last 12 months)2,453
- Downloads (Last 6 weeks)375
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Survey of CPU-GPU Heterogeneous Computing Techniques

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Heterogeneous parallel_for Template for CPU---GPU Chips