skip to main content
survey
Public Access

A Survey of CPU-GPU Heterogeneous Computing Techniques

Published:21 July 2015Publication History
Skip Abstract Section

Abstract

As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

References

  1. Alejandro Acosta, Robert Corujo, Vicente Blanco, and Francisco Almeida. 2010. Dynamic load balancing on heterogeneous multicore/multi-GPU systems. In International Conference on High Performance Computing and Simulation (HPCS). 467--476.Google ScholarGoogle Scholar
  2. Jose Ignacio Agulleiro, Francisco Vazquez, Ester M. Garzon, and Jose J. Fernandez. 2012. Hybrid computing: CPU+ GPU co-processing and its application to tomographic reconstruction. Ultramicroscopy 115 (2012), 109--114.Google ScholarGoogle ScholarCross RefCross Ref
  3. Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Samuel Thibault, and Stanimire Tomov. 2011. QR factorization on a multicore node enhanced with multiple GPU accelerators. IEEE International Parallel & Distributed Processing Symposium, 932--943. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Omer Erdil Albayrak, Ismail Akturk, and Ozcan Ozturk. 2012. Effective kernel mapping for OpenCL applications in heterogeneous platforms. In 41st International Conference on Parallel Processing Workshops (ICPPW). IEEE, 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando D. Quesada, and Tomás Ramírez. 2013. Hybrid-parallel algorithms for 2D Green’s functions. Procedia Computer Science 18 (2013), 541--550.Google ScholarGoogle ScholarCross RefCross Ref
  6. Hartwig Anzt, Vincent Heuveline, José I. Aliaga, Maribel Castillo, Juan C. Fernandez, Rafael Mayo, and Enrique S. Quintana-Orti. 2011. Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms. In International Green Computing Conference and Workshops (IGCC). IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Eduard Ayguade, Rosa M. Badia, Daniel Cabrera, Alejandro Duran, Marc Gonzalez, Francisco Igual, Daniel Jimenez, Jesus Labarta, Xavier Martorell, Rafael Mayo, Josep M. Perez, and Enrique S. Quintana-Ortí. 2009. A proposal to extend the OpenMP tasking model for heterogeneous architectures. In Evolving OpenMP in an Age of Extreme Parallelism. 154--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ana Balevic and Bart Kienhuis. 2011. An efficient stream buffer mechanism for dataflow execution on heterogeneous platforms with GPUs. In First Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM). IEEE, 53--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dip Sankar Banerjee, Aman Kumar Bahl, and Kishore Kothapalli. 2012. An on-demand fast parallel pseudo random number generator with applications. In InternationalParallel & Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). 1703--1711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Dip Sankar Banerjee and Kishore Kothapalli. 2011. Hybrid algorithms for list ranking and graph connected components. In International Conference on High Performance Computing. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michela Becchi, Surendra Byna, Srihari Cadambi, and Srimat Chakradhar. 2010. Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In 22nd ACM Symposium on Parallelism in Algorithms and Architectures. 82--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mehmet E. Belviranli, Laxmi N. Bhuyan, and Rajiv Gupta. 2013. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Peter Benner, Pablo Ezzatti, Daniel Kressner, Enrique S. Quintana-Ortí, and Alfredo Remón. 2011. A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU--GPU platforms. Parallel Computing 37, 8 (2011), 439--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí, and Alfredo Remón. 2010. Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function. In Euro-Par 2009—Parallel Processing Workshops. 132--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gregorio Bernabé, Javier Cuenca, and Domingo Giménez. 2013. Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. Procedia Computer Science 18, 319--328.Google ScholarGoogle ScholarCross RefCross Ref
  17. Kiran Bhaskaran-Nair, Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, Hubertus J. J. van Dam, Edoardo Aprà, and Karol Kowalski. 2013. Non-iterative multireference coupled cluster methods on heterogeneous CPU-GPU systems. Journal of Chemical Theory and Computation 9, 4 (2013), 1949--1957.Google ScholarGoogle ScholarCross RefCross Ref
  18. Alecio P. D. Binotto, Christian Daniel, Daniel Weber, Arjan Kuijper, Andre Stork, Carlos Pereira, and Dieter Fellner. 2010. Iterative SLE solvers over a CPU-GPU platform. In International Conference on High Performance Computing and Communications. 305--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Alecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andre Stork, and Dieter W. Fellner. 2011. An effective dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms. In IEEE 13th International Conference on High Performance Computing and Communications (HPCC). 78--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Murilo Boratto, Pedro Alonso, Carla Ramiro, and Marcos Barreto. 2012. Heterogeneous computational model for landform attributes representation on multicore and multi-GPU systems. Procedia Computer Science 9 (2012), 47--56.Google ScholarGoogle ScholarCross RefCross Ref
  21. Michael Boyer, Kevin Skadron, Shuai Che, and Nuwan Jayasena. 2013. Load balancing in a changing world: Dealing with heterogeneity and performance variability. In ACM International Conference on Computing Frontiers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Alexander Branover, Denis Foley, and Maurice Steinman. 2012. AMD fusion APU: Llano. IEEE Micro, 32, 2 (2012), 28--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sebastian Breß, Felix Beier, Hannes Rauhe, Kai-Uwe Sattler, Eike Schallehn, and Gunter Saake. 2013. Efficient co-processor utilization in database query processing. Information Systems 38, 8 (2013), 1084--1096. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jun Chai, Huayou Su, Mei Wen, Xing Cai, Nan Wu, and Chunyuan Zhang. 2013. Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference. The Journal of Supercomputing 66, 1 (2013), 364--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization. 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Bo Chen, Yun Xu, Jiaoyun Yang, and Haitao Jiang. 2010. A new parallel method of Smith-Waterman algorithm on a heterogeneous platform. In Algorithms and Architectures for Parallel Processing. Springer, 79--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Linchuan Chen, Xin Huo, and Gagan Agrawal. 2012. Accelerating mapreduce on a coupled CPU-GPU architecture. In SC’12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 25:1--25:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hong Jun Choi, Dong Oh Son, Seung Gu Kang, Jong Myon Kim, Hsien-Hsin Lee, and Cheol Hong Kim. 2013. An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. The Journal of Supercomputing 65, 2 (2013), 886--902. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Siddharth Choudhary, Shubham Gupta, and P. J. Narayanan. 2012. Practical time bundle adjustment for 3D reconstruction on the GPU. In Trends and Topics in Computer Vision, Lecture Notes in Computer Science, Volume 6554. Springer, Berlin, 423--435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. David Clarke, Aleksandar Ilic, Alexey Lastovetsky, and Leonel Sousa. 2012. Hierarchical partitioning algorithm for scientific computing on highly heterogeneous CPU+ GPU clusters. In Euro-Par Parallel Processing, Lecture Notes in Computer Science, Volume 7484. Springer, Berlin, 489--501. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Christian Conti, Diego Rossinelli, and Petros Koumoutsakos. 2012. GPU and APU computations of finite time lyapunov exponent fields. Jouranl of Computational Physics 231, 5 (2012), 2229--2244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. R. da S. Junior, Esteban W. Clua, Anselmo Montenegro, and Paulo A. Pagliosa. 2010. Fluid simulation with two-way interaction rigid body using a heterogeneous GPU and CPU environment. In Brazilian Symposium on Games and Digital Entertainment (SBGAMES). IEEE, 156--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mayank Daga, Ashwin M. Aji, and Wu-chun Feng. 2011. On the efficacy of a fused CPU+ GPU processor (or APU) for parallel computing. In Symposium on Application Accelerators in High-Performance Computing (SAAHPC). IEEE, 141--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Satish Damaraju, Varghese George, Sanjeev Jahagirdar, Tanveer Khondker, Robert Milstrey, Sanjib Sarkar, Scott Siers, Israel Stolero, and Arun Subbiah. 2012. A 22nm IA multi-CPU and GPU system-on-chip. In IEEE International Solid-State Circuits Conference Digest of Technical Papers. 56--57.Google ScholarGoogle ScholarCross RefCross Ref
  35. Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU’10). 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Michael Christopher Delorme. 2013. Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit. Master of Applied Science Thesis, University of Toronto.Google ScholarGoogle Scholar
  37. Aditya Deshpande, Ishan Misra, and P. J. Narayanan. 2011. Hybrid implementation of error diffusion dithering. In Proceedings of the 2011 18th International Conference on High Performance Computing. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC’08). 197--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. 2009. Using graphics processors for high performance IR query processing. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 421--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Adam Dziekonski, Adam Lamecki, and Michal Mrozowski. 2011. Tuning a hybrid GPU-CPU V-Cycle multilevel preconditioner for solving large real and complex systems of FEM equations. IEEE Antennas and Wireless Propagation Letters 10 (2011), 619--622.Google ScholarGoogle ScholarCross RefCross Ref
  41. Toshio Endo, Akira Nukada, Satoshi Matsuoka, and Naoya Maruyama. 2010. Linpack evaluation on a supercomputer with heterogeneous accelerators. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS’’10). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  42. Eric J. Fluhr, Joshua Friedrich, Daniel Dreps, Victor Zyuban, Gregory Still, Christopher Gonzalez, Allen Hall, David Hogenmiller, Frank Malgioglio, Ryan Nett, and others. 2014. 5.1 POWER8TM: A 12-core server-class processor in 22nm SOI with 7.6 Tb/s off-chip bandwidth. In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’14). 96--97.Google ScholarGoogle ScholarCross RefCross Ref
  43. Peng Cheng Gao, Yu Bo Tao, Zhi Hui Bai, and Hai Lin. 2012. Mapping the SBR and TW-ILDCs to heterogeneous CPU-GPU architecture for fast computation of electromagnetic scattering. Progress In Electromagnetics Research 122 (2012), 137--154.Google ScholarGoogle ScholarCross RefCross Ref
  44. Michael T. Garba and Horacio González-vélez. 2012. Asymptotic peak utilisation in heterogeneous parallel CPU/GPU pipelines: A decentralised queue monitoring strategy. Parallel Processing Letters 22, 2 (2012).Google ScholarGoogle ScholarCross RefCross Ref
  45. Eric Gardner. 2014. https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landing.Google ScholarGoogle Scholar
  46. Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. 2010. An asymmetric distributed shared memory model for heterogeneous parallel systems. In ACM SIGARCH Computer Architecture News, 38 1 (March 2010), 347--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, and Matei Ripeanu. 2012. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 345--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Green500. 2014. Green500 Supercomputers. Retrieved from www.green500.org.Google ScholarGoogle Scholar
  49. Chris Gregg, Michael Boyer, Kim Hazelwood, and Kevin Skadron. 2011. Dynamic heterogeneous scheduling decisions using historical runtime data. In Proceedings of the 2nd Workshop on Applications for Multi- and Many-Core Processors.Google ScholarGoogle Scholar
  50. Chris Gregg, Jeff Brantley, and Kim Hazelwood. 2010. Contention-aware scheduling of parallel code for heterogeneous systems. In Proceedings of the USENIX Workshop on Hot Topics in Parallelism (HotPar’10).Google ScholarGoogle Scholar
  51. Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11). 134--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Dominik Grewe and Michael F. P. O’Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software. Springer, Berlin, 286--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. 2010. Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 205--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Sumit Gupta. 2014. http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/.Google ScholarGoogle Scholar
  55. Tomoaki Hamano, Toshio Endo, and Satoshi Matsuoka. 2009. Power-aware dynamic task scheduling for heterogeneous accelerated clusters. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Scott S. Hampton, Sadaf R. Alam, Paul S. Crozier, and Pratul K. Agarwal. 2010. Optimal utilization of heterogeneous resources for biomolecular simulations. In Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. David J. Hardy, John E. Stone, and Klaus Schulten. 2009. Multilevel summation of electrostatic potentials using graphics processing units. Parallel Comput. 35, 3 (2009), 164--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Timothy D. R. Hartley, Umit Catalyurek, Antonio Ruiz, Francisco Igual, Rafael Mayo, and Manuel Ujaldon. 2008. Biomedical image analysis on a cooperative cluster of GPUs and multicores. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS’08). 15--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Timothy D. R. Hartley, Erik Saule, and Umit V. Catalyurek. 2010. Automatic dataflow application tuning for heterogeneous systems. In Proceedings of the 2010 International Conference on High Performance Computing (HiPC’10). 1--10.Google ScholarGoogle Scholar
  60. Kenneth Arthur Hawick and Daniel P. Playne. 2013. Parallel algorithms for hybrid multi-core CPU-GPU implementations of component labelling in critical phase models. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’13). 45--51.Google ScholarGoogle Scholar
  61. Zhengyu He and Bo Hong. 2010. Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-hybrid platforms. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS’’10). 1--10.Google ScholarGoogle Scholar
  62. Everton Hermann, Bruno Raffin, François Faure, Thierry Gautier, and Jérémie Allard. 2010. Multi-GPU and multi-CPU parallelization for interactive physics simulations. In Proceedings of the Euro-Par 2010-Parallel Processing, Lecture Notes in Computer Science, Volume 6272. Springer, Berlin, 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Tayler H. Hetherington, Timothy G. Rogers, Lisa Hsu, Mike O’Connor, and Tor M. Aamodt. 2012. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’12). 88--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, and Haibo Lin. 2010. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 217--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 78--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Mitch Horton, Stanimire Tomov, and Jack Dongarra. 2011. A class of hybrid lapack algorithms for multicore and GPU architectures. In Symposium on Application Accelerators in High-Performance Computing (SAAHPC’11). IEEE, 150--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Qi Hu, Nail A. Gumerov, and Ramani Duraiswami. 2011. Scalable fast multipole methods on distributed heterogeneous architectures. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, New York, NY, Article 36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Alan Humphrey, Qingyu Meng, Martin Berzins, and Todd Harman. 2012. Radiation modeling using the Uintah heterogeneous CPU/GPU runtime system. In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the Campus and Beyond. ACM, New York, NY, Article 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Xin Huo, Vignesh T. Ravi, and Gagan Agrawal. 2011. Porting irregular reductions on heterogeneous CPU-GPU configurations. In Proceedings of the 18th International Conference on High Performance Computing. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Insieme Compiler. 2014. http://www.dps.uibk.ac.at/insieme/index.html.Google ScholarGoogle Scholar
  71. International Telecommunication Union. 2012. Retrieved from http://www.itu.int/dms_pub/itu-d/opb/ind/D-IND-ICTOI-2012-SUM-PDF-E.pdf.Google ScholarGoogle Scholar
  72. Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012. Dynamically managed data for CPU-GPU architectures. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). 165--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Pritish Jetley, Lukasz Wesolowski, Filippo Gioachin, Laxmikant V. Kalé, and Thomas R. Quinn. 2010. Scaling hierarchical N-body simulations on GPU clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Wei Jiang and Gagan Agrawal. 2012. Mate-CG: A map reduce-like framework for accelerating data-intensive computations on heterogeneous clusters. In Proceedings of the IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS’’12). 644--655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Víctor J. Jiménez, Lluís Vilanova, Isaac Gelado, Marisa Gil, Grigori Fursin, and Nacho Navarro. 2009. Predictive runtime code scheduling for heterogeneous architectures. In High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, Volume 5409. Springer, Berlin, 19--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Mark Joselli, Marcelo Zamith, Esteban Clua, Anselmo Montenegro, Aura Conci, Regina Leal-Toledo, Luis Valente, Bruno Feijó, Marcos d’Ornellas, and Cesar Pozzer. 2008. Automatic dynamic task distribution between CPU and GPU for real-time systems. In Proceedings of the 11th IEEE International Conference on Computational Science and Engineering. 48--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 341--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Klaus Kofler, Ivan Grasso, Biagio Cosenza, and Thomas Fahringer. 2013. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th ACM International Conference on Supercomputing (ICS’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Sai Kiran Korwar, Sathish Vadhiyar, and Ravi S. Nanjundiah. 2013. GPU-enabled efficient executions of radiation calculations in climate modeling. In Proceedings of the 20th International Conference on High Performance Computing (HiPC’13). IEEE, 353--361.Google ScholarGoogle Scholar
  80. Kishore Kothapalli, Dip Sankar Banerjee, P. J. Narayanan, Surinder Sood, Aman Kumar Bahl, Shashank Sharma, Shrenik Lad, Krishna Kumar Singh, Kiran Matam, Sivaramakrishna Bharadwaj, Rohit Nigam, Parikshit Sakurikar, Aditya Deshpande, Ishan Misra, Siddharth Choudhary, and Shubham Gupta. 2013. CPU and/or GPU: Revisiting the GPU Vs. CPU Myth. arXiv preprint arXiv:1303.2171.Google ScholarGoogle Scholar
  81. Jens Lang and Gudula Rünger. 2013. Dynamic distribution of workload between CPU and GPU for a parallel conjugate gradient method in an adaptive FEM. Procedia Computer Science 18 (2013), 299--308.Google ScholarGoogle ScholarCross RefCross Ref
  82. Fabian Lecron, Sidi Ahmed Mahmoudi, Mohammed Benjelloun, Saïd Mahmoudi, and Pierre Manneback. 2011. Heterogeneous computing for vertebra detection and segmentation in X-ray images. Journal of Biomedical Imaging 2011, Article 5 (Jan. 2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Changmin Lee, Won W. Ro, and Jean-Luc Gaudiot. 2012. Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids. In 16th Workshop on Interaction between Compilers and Computer Architectures (INTERACT). IEEE, 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). 245--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Kenneth Lee, Heshan Lin, and Wu-chun Feng. 2013a. Performance characterization of data-intensive kernels on AMD fusion architectures. Computer Science—Research and Development 28, 2--3 (May 2013), 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 451--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Hung-Fu Li, Tyng-Yeu Liang, and Jun-Yao Chiu. 2013. A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. In The Journal of Supercomputing 66, 1 (2013), 381--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2012. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). 377--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Linchuan Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Peiheng Zhang. 2011. Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system. In Proceedings of the 20th International Symposium on High Performance Distributed Computing. ACM, New York, NY, 195--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Cong Liu, Jian Li, Wei Huang, Juan Rubio, Evan Speight, and Xiaozhu Lin. 2012. Power-efficient time-sensitive mapping in heterogeneous systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Ding Liu, Ruixuan Li, Xiwu Gu, Kunmei Wen, Heng He, and Guoqiang Gao. 2011. Fast snippet generation based on CPU-GPU hybrid system. In Proceedings of the IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS’11). 252--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Qiang Liu and Wayne Luk. 2012. Heterogeneous systems for energy efficient scientific computing. In Reconfigurable Computing: Architectures, Tools and Applications. Springer, 64--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Wenjie Liu, Zhihui Du, Yu Xiao, David A. Bader, and Chen Xu. 2011. A waterfall model to achieve energy efficient tasks mapping for large scale GPU clusters. In Proceedings of the International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW’11). 82--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Yixun Liu, Andriy Fedorov, Ron Kikinis, and Nikos Chrisochoides. 2009. Real-time non-rigid registration of medical images on a cooperative parallel architecture. In IEEE International Conference on Bioinformatics and Biomedicine. 401--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. 2011. A scalable high performant Cholesky factorization for multicore with GPU accelerators. In High Performance Computing for Computational Science--VECPAR 2010, Lecture Notes in Computer Science, Volume 6449. Springer, Berlin, 93--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Fengshun Lu, Junqiang Song, Xiaoqun Cao, and Xiaoqian Zhu. 2012a. CPU/GPU computing for long-wave radiation physics on large GPU clusters. Computers & Geosciences 41 (April 2012), 47--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Fengshun Lu, Junqiang Song, Fukang Yin, and Xiaoqian Zhu. 2012b. Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters. Computer Physics Communications 183, 6 (2012), 1172--1181.Google ScholarGoogle ScholarCross RefCross Ref
  98. Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd International Symposium on Microarchitecture (MICRO). ACM, New York, NY, 45--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Li Luo, Chao Yang, Yubo Zhao, and Xiao-Chuan Cai. 2011. A scalable hybrid algorithm based on domain decomposition and algebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs. In Proceedings of the 2nd International Workshop on GPUs and Scientific Applications. 45--50.Google ScholarGoogle Scholar
  100. Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang. 2012. GreenGPU: A holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In Proceedings of the 41st International Conference on Parallel Processing (ICPP). IEEE, 48--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, Karol Kowalski, and Gagan Agrawal. 2013. Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Cluster Computing 16, 1 (March 2013), 131--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Artur Mariano, Ricardo Alves, Joao Barbosa, Luis Paulo Santos, and Alberto Proenca. 2012. A (ir) regularity-aware task scheduler for heterogeneous platforms. In Proceedings of the International Conference on High Performance Computing.Google ScholarGoogle Scholar
  103. Kiran Kumar Matam, Siva Rama Krishna Bharadwaj, and Kishore Kothapalli. 2012. Sparse matrix matrix multiplication on hybrid CPU+ GPU platforms. In Proceedings of the High Performance Computing Conference (HiPC’12).Google ScholarGoogle Scholar
  104. Jeremy S. Meredith, Philip C. Roth, Kyle L. Spafford, and Jeffrey S. Vetter. 2011. Performance implications of nonuniform device topologies in scalable heterogeneous architectures. IEEE Micro 31, 5 (2011), 66--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Perhaad Mistry, Yash Ukidave, Dana Schaa, and David Kaeli. 2013a. A framework for profiling and performance monitoring of heterogeneous applications. Programmability Issues for Heterogeneous Multicores (MULTIPROG-2013).Google ScholarGoogle Scholar
  106. Perhaad Mistry, Yash Ukidave, Dana Schaa, and David Kaeli. 2013b. Valar: A benchmark suite to study the dynamic behavior of heterogeneous systems. In Proceedings of the 6th Workshop on General Purpose Processor using Graphics Processing Units (GPGPU’13). ACM, New York, NY, 54--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Sparsh Mittal. 2012. A survey of architectural techniques for DRAM power management. International Journal of High Performance Systems Architecture 4, 2 (Dec. 2012), 110--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Sparsh Mittal. 2014a. A survey of techniques for managing and leveraging caches in GPUs. Journal of Circuits, Systems, and Computers (JCSC) 23, 8 (2014).Google ScholarGoogle Scholar
  109. Sparsh Mittal. 2014b. A survey of architectural techniques for improving cache power efficiency. Elsevier Sustainable Computing: Informatics and Systems 4, 1 (2014), 33--43.Google ScholarGoogle ScholarCross RefCross Ref
  110. Sparsh Mittal. 2014c. A survey of techniques for improving energy efficiency in embedded computing systems. International Journal of Computer Aided Engineering and Technology (IJCAET) 46, 4, Article 47 (April 2014).Google ScholarGoogle Scholar
  111. Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of methods for analyzing and improving GPU energy efficiency. ACM Computing Surveys 47, 2, Article 19 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Timothy Prickett Morgan. 2014. Oracle Cranks up the Cores to 32 with Sparc M7 Chip. Retrieved from http://www.enterprisetech.com/2014/08/13/oracle-cranks-cores-32-sparc-m7-chip/.Google ScholarGoogle Scholar
  113. Lluis-Miquel Munguia, David A. Bader, and Eduard Ayguade. 2012. Task-based parallel breadth-first search in heterogeneous environments. In Proceedings of the 19th International Conference on High Performance Computing (HiPC’12). 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  114. Jun-ichi Muramatsu, Takeshi Fukaya, Shao-Liang Zhang, Kinji Kimura, and Yusaku Yamamoto. 2011. Acceleration of Hessenberg reduction for nonsymmetric eigenvalue problems in a hybrid CPU-GPU computing environment. International Journal of Networking and Computing 1, 2 (2011).Google ScholarGoogle Scholar
  115. Alin Muraraşu, Josef Weidendorfer, and Arndt Bode. 2012. Workload balancing on heterogeneous systems: A case study of sparse grid interpolation. In Euro-Par 2011: Parallel Processing Workshops, Lecture Notes in Computer Science, Volume 7156. Springer, Berlin, 345--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Naohito Nakasato, Go Ogiya, Yohei Miki, Masao Mori, and Ken’ichi Nomoto. 2012. Astrophysical particle simulations on heterogeneous CPU-GPU systems. In arXiv preprint arXiv:1206.1199.Google ScholarGoogle Scholar
  117. Andrew Nere, Sean Franey, Atif Hashmi, and Mikko Lipasti. 2012. Simulating cortical networks on heterogeneous multi-GPU systems. J. Parallel and Distrib. Comput. 43, 7 (July 2012), 953--971.Google ScholarGoogle Scholar
  118. Rohit Nigam and P. J. Narayanan. 2012. Hybrid ray tracing and path tracing of Bezier surfaces using a mixed hierarchy. In Proceedings of the 8th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP’12). Article 35, 35:1--35:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. NVIDIA. 2015. http://www.geforce.com/hardware/desktop-gpus.Google ScholarGoogle Scholar
  120. Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Jinpil Lee, and Mitsuhisa Sato. 2012. GPU/CPU work sharing with parallel language XcalableMP-dev for parallelized accelerated computing. In Proceedings of the 41st International Conference on Parallel Processing Workshops (ICPPW’12). 97--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Yasuhito Ogata, Toshio Endo, Naoya Maruyama, and Satoshi Matsuoka. 2008. An efficient, model-based CPU-GPU heterogeneous FFT library. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--10.Google ScholarGoogle Scholar
  122. Satoshi Ohshima, Kenji Kise, Takahiro Katagiri, and Toshitsugu Yuba. 2007. Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment. In Proceedings of the 7th International Conference on High Performance Computing for Computational Science-VECPAR 2006. Springer, 305--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. OpenACC Standard. 2014. Homepage. Retrieved from http://www.openacc-standard.org/.Google ScholarGoogle Scholar
  124. OpenMP 4.0. 2014. Homepage. Retrieved from http://openmp.org/wp/2013/07/openmp-40/.Google ScholarGoogle Scholar
  125. Edson Luiz Padoin, Laércio Lima Pilla, Francieli Zanon Boito, Rodrigo Virote Kassick, Pedro Velho, and Philippe O. A. Navaux. 2013. Evaluating application performance and energy consumption on hybrid CPU+ GPU architecture. Cluster Computing 16, 3 (Sept. 2013), 511--525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Sreepathi Pai, Ramaswamy Govindarajan, and Matthew Jacob Thazhuthaveetil. 2010. PLASMA: Portable programming for SIMD heterogeneous accelerators. In Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU.Google ScholarGoogle Scholar
  127. Anthony Pajot, Loïc Barthe, Mathias Paulin, and Pierre Poulin. 2011. Combinatorial bidirectional path-tracing for efficient hybrid CPU/GPU rendering. In Computer Graphics Forum 30, 2 (April 2011), 315--324.Google ScholarGoogle ScholarCross RefCross Ref
  128. Prasanna Pandit and R. Govindarajan. 2014. Fluidic kernels: Cooperative execution of OpenCL programs on multiple heterogeneous devices. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). Article 273, 273:273--273:283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Jairo Panetta, Thiago Teixeira, Paulo R. P. de Souza Filho, Carlos A. da Cunha Filho, David Sotelo, Fernando M. Roxo da Motta, Silvio Sinedino Pinheiro, Ivan Pedrosa Junior, Andre L. Romanelli Rosa, Luiz R. Monnerat, Leandro T. Carneiro, and Carlos H. B. de Albrecht. 2009. Accelerating Kirchhoff migration by CPU and GPU cooperation. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’09). 26--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Manolis Papadrakakis, George Stavroulakis, and Alexander Karatarakis. 2011. A new era in scientific computing: Domain decomposition methods in hybrid CPU--GPU architectures. Computer Methods in Applied Mechanics and Engineering 200, 13 (2011), 1490--1508.Google ScholarGoogle ScholarCross RefCross Ref
  131. Song Jun Park, James A. Ross, Dale R. Shires, David A. Richie, Brian J. Henz, and Lam H. Nguyen. 2011. Hybrid core acceleration of UWB SIRE radar signal processing. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman Amarasinghe. 2013. Portable performance on heterogeneous architectures. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. 431--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Jacques A. Pienaar, Srimat Chakradhar, and Anand Raghunathan. 2012. Automatic generation of software pipelines for heterogeneous parallel systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). Article 24. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Jacques A. Pienaar, Anand Raghunathan, and Srimat Chakradhar. 2011. MDR: Performance model driven runtime for heterogeneous parallel platforms. In Proceedings of the ACM International Conference on Supercomputing. ACM, New York, NY, 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Holger Pirk, Thibault Sellam, Stefan Manegold, and Martin Kersten. 2012. X-device query processing by bitwise distribution. In 8th International Workshop on Data Management on New Hardware. ACM, New York, NY, 48--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. Usman Pirzada. 2015. Nvidia Geforce GTX TITAN X Unveiled - GM200 ‘Big Daddy Maxwell’, 12GB VRam and 8 Billion Transistors. Retrieved from wccftech.com/nvidia-gtx-titan-x-revealed-gdc-2015/.Google ScholarGoogle Scholar
  137. Matthew Poremba, Sparsh Mittal, Dong Li, Jeffrey Vetter, and Yuan Xie. 2015. DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches. In DATE. 1543--1546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Ashwin Prasad, Jayvant Anantpur, and R. Govindarajan. 2011. Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors. In ACM Sigplan Notices 46, 6 (June 2011), 152--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Abtin Rahimian, Ilya Lashuk, Shravan Veerapaneni, Aparna Chandramowlishwaran, Dhairya Malhotra, Logan Moon, Rahul Sampath, Aashay Shringarpure, Jeffrey Vetter, Richard Vuduc, Denis Zorin, and George Biros. 2010. Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10). 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Vignesh T. Ravi and Gagan Agrawal. 2011. A dynamic scheduling framework for emerging heterogeneous systems. In Proceedings of the 18th International Conference on High Performance Computing (HiPC’11). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Vignesh T. Ravi, Wenjing Ma, David Chiu, and Gagan Agrawal. 2010. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM International Conference on Supercomputing. ACM, New York, NY, 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. Vignesh T. Ravi, Wenjing Ma, David Chiu, and Gagan Agrawal. 2012. Compiler and runtime support for enabling reduction computations on heterogeneous systems. Concurrency and Computation: Practice and Experience 24, 5 (2012), 463--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. Mahsan Rofouei, Thanos Stathopoulos, Sebi Ryffel, William Kaiser, and Majid Sarrafzadeh. 2008. Energy-aware high performance computing with graphic processing units. In Proceedings of the 2008 Conference on Power Aware Computing and Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. Bratin Saha, Xiaocheng Zhou, Hu Chen, Ying Gao, Shoumeng Yan, Mohan Rajagopalan, Jesse Fang, Peinan Zhang, Ronny Ronen, and Avi Mendelson. 2009. Programming model for a heterogeneous x86 platform. In ACM Sigplan Notices 44, 6 (2009), 431--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. Thomas R. W. Scogland, Wu-chun Feng, Barry Rountree, and Bronis R. de Supinski. 2014. CoreTSAR: Adaptive worksharing for heterogeneous systems. In Supercomputing, Lecture Notes in Computer Science, Volume 8488. Springer International Publishing, 172--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. Thomas R. W. Scogland, Barry Rountree, Wu-chun Feng, and Bronis R. de Supinski. 2012. Heterogeneous task scheduling for accelerated OpenMP. In Proceedings of the IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS’’12). 144--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. Jie Shen, Ana Lucia Varbanescu, Henk Sips, Michael Arntzen, and Dick G. Simons. 2013. Glinda: A framework for accelerating imbalanced applications on heterogeneous platforms. In Proceedings of the ACM International Conference on Computing Frontiers. ACM, New York, NY, Article 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. Wenfeng Shen, Daming Wei, Weimin Xu, Xin Zhu, and Shizhong Yuan. 2010. Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU. Computer Methods and Programs in Biomedicine 100, 1 (2010), 87--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  149. Takashi Shimokawabe, Takayuki Aoki, Tomohiro Takaki, Akinori Yamanaka, Akira Nukada, Toshio Endo, Naoya Maruyama, and Satoshi Matsuoka. 2011. Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY. Article 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. Koichi Shirahata, Hitoshi Sato, and Satoshi Matsuoka. 2010. Hybrid map task scheduling for GPU-based heterogeneous clusters. In Proceedings of the IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom’10). 733--740. Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. Sambit K. Shukla and Laxmi N. Bhuyan. 2013. A hybrid shared memory heterogeneous execution platform for PCIe-based GPGPUs. In 20th International Conference on High Performance Computing. 343--352.Google ScholarGoogle Scholar
  152. Jakob Siegel, Oreste Villa, Sriram Krishnamoorthy, Antonino Tumeo, and Xiaoming Li. 2010. Efficient sparse matrix-matrix multiplication on heterogeneous high performance systems. In 2010 IEEE International Conference on Cluster Computing Workshops. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  153. Mark Silberstein and Naoya Maruyama. 2011. An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures. In Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR’11). ACM, New York, NY, Article 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  154. Jaideeep Singh and Ipseeta Aruni. 2011. Accelerating smith-waterman on heterogeneous cpu-gpu systems. In Proceedings of the 5th International Conference on Bioinformatics and Biomedical Engineering (iCBBE’11). 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  155. Hayden K. H. So, Junying Chen, Billy YS Yiu, and Alfred C. H. Yu. 2011. Medical ultrasound imaging: To GPU or not to GPU? IEEE Micro 31, 5 (2011), 54--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  156. F. Sottile, C. Roedl, V. Slavnic, P. Jovanovic, D. Stankovic, P. Kestener, and Franck Houssen. 2013. GPU implementation of the DP code. Partnership for Advanced Computing in Europe.Google ScholarGoogle Scholar
  157. Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter. 2010. Maestro: Data orchestration and tuning for OpenCL devices. In Euro-Par 2010-Parallel Processing, Lecture Notes in Computer Science, Volume 6272. Springer, Berlin, 275--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. Kyle L. Spafford, Jeremy S. Meredith, Seyong Lee, Dong Li, Philip C. Roth, and Jeffrey S. Vetter. 2012. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In Proceedings of the 9th Conference on Computing Frontiers. 103--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  159. Tomasz P. Stefanski. 2013. Implementation of FDTD-compatible Green’s function on heterogeneous CPU-GPU parallel processing system. Progress in Electromagnetics Research 135 (2013), 297--316.Google ScholarGoogle ScholarCross RefCross Ref
  160. John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  161. Przemysław Stpiczynski. 2011. Solving linear recurrences on hybrid GPU accelerated manycore systems. In Proceedings of the Federated Conference on Computer Science and Information Systems. 465--470.Google ScholarGoogle Scholar
  162. Przemysław Stpiczynski and Joanna Potiopa. 2010. Solving a kind of BVP for ODEs on heterogeneous CPU+ CUDA-enabled GPU systems. In Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT’10). IEEE, 349--353.Google ScholarGoogle ScholarCross RefCross Ref
  163. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W. M. W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing.Google ScholarGoogle Scholar
  164. Yu Su, Ding Ye, and Jingling Xue. 2013. Accelerating inclusion-based pointer analysis on heterogeneous CPU-GPU systems. In Proceedings of the 20th International Conference on High Performance Computing. 149--158.Google ScholarGoogle ScholarCross RefCross Ref
  165. Enqiang Sun, Dana Schaa, Richard Bagley, Norman Rubin, and David Kaeli. 2012. Enabling task-level scheduling on heterogeneous platforms. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. ACM, New York, NY, 84--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  166. Hiroyuki Takizawa, Katsuto Sato, and Hiroaki Kobayashi. 2008. SPRAT: Runtime processor selection for energy-aware computing. In Proceedings of the IEEE International Conference on Cluster Computing. 386--393.Google ScholarGoogle ScholarCross RefCross Ref
  167. Yu Shyang Tan, Bu-Sung Lee, Bingsheng He, and Roy H. Campbell. 2012. A map-reduce based framework for heterogeneous processing element cluster environments. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 57--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  168. George Teodoro, Tahsin M. Kurc, Tony Pan, Lee A. D. Cooper, Jun Kong, Patrick Widener, and Joel H. Saltz. 2012. Accelerating large scale image analyses on parallel, CPU-GPU equipped systems. In Proceedings of the IEEE 26th International Parallel & Distributed Processing Symposium. 1093--1104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  169. George Teodoro, Tony Pan, Tahsin M. Kurc, Jun Kong, Lee A. D. Cooper, and Joel H. Saltz. 2013. Efficient irregular wavefront propagation algorithms on hybrid CPU-GPU machines. Parallel Comput. 39, 4--5 (2013), 189--211.Google ScholarGoogle ScholarCross RefCross Ref
  170. George Teodoro, Rafael Sachetto, Olcay Sertel, Metin N. Gurcan, W. Meira, Umit Catalyurek, and Renato Ferreira. 2009. Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In International Conference on Cluster Computing and Workshops. 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  171. Pablo Toharia, Oscar D. Robles, Ricardo SuáRez, Jose Luis Bosque, and Luis Pastor. 2012. Shot boundary detection using Zernike moments in multi-GPU multi-CPU architectures. Journal of Parallel and Distributed Computing 72, 9 (Sept. 2012), 1127--1133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  172. Stanimire Tomov, Rajib Nath, and Jack Dongarra. 2010. Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36, 12 (2010), 645--654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  173. Top500. 2014. Top500 Supercomputers. www.top500.org.Google ScholarGoogle Scholar
  174. Kuen Hung Tsoi and Wayne Luk. 2010. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, 115--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  175. Fernando Tsuda and Ricardo Nakamura. 2011. A technique for collision detection and 3D interaction based on parallel GPU and CPU processing. In Proceedings of the 2011 Brazilian Symposium on Games and Digital Entertainment (SBGAMES’11). 36--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  176. Abhishek Udupa, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2009. Synergistic execution of stream programs on multicores with accelerators. ACM Sigplan Notices 44, 7 (2009), 99--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  177. Yash Ukidave, Amir Kavyan Ziabari, Perhaad Mistry, Gunar Schirner, and David Kaeli. 2013. Quantifying the energy efficiency of FFT on heterogeneous platforms. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’13). 235--244.Google ScholarGoogle ScholarCross RefCross Ref
  178. Ronald Veldema, Thorsten Blass, and Michael Philippsen. 2011. Enabling multiple accelerator acceleration for Java/OpenMP. In Proceedings of the 3rd USENIX Conference on Hot Topics in Parallelism (HotPar). Google ScholarGoogle ScholarDigital LibraryDigital Library
  179. Sundaresan Venkatasubramanian and Richard W. Vuduc. 2009. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, NY, 244--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  180. Uri Verner, Assaf Schuster, and Mark Silberstein. 2011. Processing data streams with hard real-time constraints on heterogeneous systems. In Proceedings of the international conference on Supercomputing (ICS’11). ACM, New York, NY, 120--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  181. Jeffrey S. Vetter and Sparsh Mittal. 2015. Opportunities for nonvolatile memory systems in extreme-scale high performance computing. Computing in Science and Engineering 17, 2 (2015), 73--82.Google ScholarGoogle ScholarDigital LibraryDigital Library
  182. Christof Vömel, Stanimire Tomov, and Jack Dongarra. 2012. Divide and conquer on hybrid GPU-accelerated multicore systems. SIAM Journal on Scientific Computing 34, 2 (2012), C70--C82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  183. Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. 2010. On the limits of GPU acceleration. In Proceedings of the 2nd USENIX Conference on Hot Topics in Parallelism. Google ScholarGoogle ScholarDigital LibraryDigital Library
  184. Guibin Wang and Wei Song. 2011. Communication-aware task partition and voltage scaling for energy minimization on heterogeneous parallel systems. In Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’11). 327--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  185. Yueqing Wang, Yong Dou, Song Guo, Yuanwu Lei, and Dan Zou. 2014. CPU--GPU hybrid parallel strategy for cosmological simulations. Concurrency and Computation: Practice and Experience 26, 3 (March 2014), 748--765. Google ScholarGoogle ScholarDigital LibraryDigital Library
  186. Yu Wang, Haixiao Du, Mingrui Xia, Ling Ren, Mo Xu, Teng Xie, Gaolang Gong, Ningyi Xu, Huazhong Yang, and Yong He. 2013a. A hybrid CPU-GPU accelerated framework for fast mapping of high-resolution human brain connectome. PloS ONE 8, 5 (2013), e62789.Google ScholarGoogle ScholarCross RefCross Ref
  187. Zhenning Wang, Long Zheng, Quan Chen, and Minyi Guo. 2013b. CAP: Co-scheduling based on asymptotic profiling in CPU+ GPU hybrid systems. In International Workshop on Programming Models and Applications for Multicores and Manycores. ACM, New York, NY, 107--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  188. Mei Wen, Huayou Su, Wenjie Wei, Nan Wu, Xing Cai, and Chunyuan Zhang. 2012. Using 1000+ GPUs and 10000+ CPUs for sedimentary basin simulations. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER). 27--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  189. Dieter Wendel, Ronald Kalla, Joshua Friedrich, James Kahle, Jens Leenstra, Cedric Lichtenau, Balaram Sinharoy, William Starke, and Victor Zyuban. 2010. The POWER7 processor SoC. In Proceedings of the International Conference on IC Design and Technology (ICICDT’10). 71--73.Google ScholarGoogle Scholar
  190. Qiang Wu, Canqun Yang, Feng Wang, and Jingling Xue. 2012. A fast parallel implementation of molecular dynamics with the Morse potential on a heterogeneous petascale supercomputer. In International Parallel and Distributed Processing Symposium Workshops & PhD Forum. 140--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  191. Tin-Yu Wu, Wei-Tsong Lee, Chien-Yu Duan, and Tain-Wen Suen. 2013. Enhancing cloud-based servers by GPU/CPU virtualization management. In Advances in Intelligent Systems and Applications-Volume 2. Springer, 185--194.Google ScholarGoogle Scholar
  192. Shucai Xiao, Heshan Lin, and Wu-chun Feng. 2011. Accelerating protein sequence search in a heterogeneous computing system. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium. 1212--1222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  193. Ming Xu, Feiguo Chen, Xinhua Liu, Wei Ge, and Jinghai Li. 2012. Discrete particle simulation of gas-solid two-phase flows with multi-scale CPU-GPU hybrid computation. Chemical Engineering Journal 207-208 (Oct. 2012), 746--757.Google ScholarGoogle Scholar
  194. Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi, and Kai Lu. 2010. Adaptive optimization for petascale heterogeneous CPU/GPU computing. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER). 19--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  195. Chao Yang, Wei Xue, Haohuan Fu, Lin Gan, Linfeng Li, Yangtong Xu, Yutong Lu, Jiachang Sun, Guangwen Yang, and Weimin Zheng. 2013. A peta-scalable CPU-GPU algorithm for global atmospheric simulations. In Proceedings of the 18th ACM/SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  196. Ping Yao, Hong An, Mu Xu, Gu Liu, Xiaoqiang Li, Yaobin Wang, and Wenting Han. 2010. CuHMMer: A load-balanced CPU-GPU cooperative bioinformatics application. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’10). IEEE, 24--30.Google ScholarGoogle Scholar
  197. Marcelo Yuffe, Ernest Knoll, Moty Mehalel, Joseph Shor, and Tsvika Kurts. 2011. A fully integrated multi-CPU, GPU and memory controller 32nm processor. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’11). 264--266.Google ScholarGoogle ScholarCross RefCross Ref
  198. Ziming Zhong, Vladimir Rychkov, and Alexey Lastovetsky. 2012. Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications. In Proceedings of the International Conference on Cluster Computing. 191--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  199. V. Zyuban, S. A. Taylor, B. Christensen, A. R. Hall, C. J. Gonzalez, J. Friedrich, F. Clougherty, J. Tetzloff, and R. Rao. 2013. IBM POWER7+ design for higher frequency at fixed power. IBM Journal of Research and Development 57, 6 (2013), 1:1--1:18. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Survey of CPU-GPU Heterogeneous Computing Techniques

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader