ABSTRACT
Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to respond to fluctuations in device performance that occur at run time. We present a novel dynamic approach to work partitioning that requires no offline training and responds automatically to performance variability to provide consistently good performance. Using six diverse OpenCL™ applications, we demonstrate the effectiveness of our approach in scenarios both with and without run-time performance variability, as well as in more extreme scenarios in which one device is non-functional.
- A. Acosta, R. Corujo, V. Blanco, and F. Almeida. Dynamic load balancing on heterogeneous multicore/multiGPU systems. In International Conference on High Performance Computing and Simulation (HPCS), July 2010.Google ScholarCross Ref
- AMD. AMD accelerated parallel processing (APP) SDK. http://developer.amd.com/appsdk.Google Scholar
- C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst. Data-aware task scheduling on multi-accelerator based platforms. In International Conference on Parallel and Distributed Systems (ICPADS), Dec. 2010. Google ScholarDigital Library
- S. Benkner et al. PEPPHER: Efficient and productive usage of hybrid computing systems. IEEE Micro, 31(5):28--41, Sept./Oct. 2011. Google ScholarDigital Library
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In International Symposium on Workload Characterization (IISWC), Oct. 2009. Google ScholarDigital Library
- L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao. Dynamic load balancing on single- and multi-GPU systems. In International Symposium on Parallel & Distributed Processing (IPDPS), Apr. 2010.Google ScholarCross Ref
- G. Diamos and S. Yalamanchili. Harmony: An execution model and runtime for heterogeneous many core systems. In High Performance Distributed Computing (HPDC), June 2008. Google ScholarDigital Library
- Z. Fan, F. Qiu, and A. E. Kaufman. Zippy: A framework for computation and visualization on a GPU cluster. Computer Graphics Forum, 27(2):341--350, Apr. 2008.Google ScholarCross Ref
- C. Gregg, M. Boyer, K. Hazelwood, and K. Skadron. Dynamic heterogeneous scheduling decisions using historical runtime data. In Workshop on Applications for Multi- and Many-Core Processors (A4MMC), June 2011.Google Scholar
- Q. Hou, K. Zhou, and B. Guo. SPAP: A programming language for heterogeneous many-core systems. Technical report, Zhejiang University Graphics and Parallel Systems Lab, Jan. 2010.Google Scholar
- J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Symposium on Principles and Practice of Parallel Programming (PPoPP), Feb. 2011. Google ScholarDigital Library
- C. Kruskal and A. Weiss. Allocating independent subtasks on parallel processors. IEEE Transactions on Software Engineering, 11(10):1001--1016, Oct. 1985. Google ScholarDigital Library
- M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: A programming model for heterogeneous multi-core systems. ACM SIGPLAN Notices, 43(3):287--296, Mar. 2008. Google ScholarDigital Library
- C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In International Symposium on Microarchitecture (MICRO), Dec. 2009. Google ScholarDigital Library
- A. Moerschell and J. D. Owens. Distributed texture memory in a multi-GPU environment. In Graphics Hardware, Sept. 2006. Google ScholarDigital Library
- C. Muller, S. Frey, M. Strengert, C. Dachsbacher, and T. Ertl. A compute unified system architecture for graphics clusters incorporating data locality. IEEE Transactions on Visualization and Computer Graphics, 15(4):605--617, July/Aug. 2009. Google ScholarDigital Library
- A. Nere, A. Hashmi, and M. Lipasti. Profiling heterogeneous multi-GPU systems to accelerate cortically inspired learning algorithms. In International Symposium on Parallel & Distributed Processing (IPDPS), May 2011. Google ScholarDigital Library
- C. D. Polychronopoulos and D. J. Kuck. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Transactions on Computers, 36(12):1425--1439, Dec. 1987. Google ScholarDigital Library
- C.-Y. Shei, P. Ratnalikar, and A. Chauhan. Automating GPU computing in MATLAB. In International Conference on Supercomputing (ICS), May 2011. Google ScholarDigital Library
- E. Sun, D. Schaa, R. Bagley, N. Rubin, and D. Kaeli. Enabling task-level scheduling on heterogeneous platforms. In Workshop on General Purpose Processing with Graphics Processing Units (GPGPU), Mar. 2012. Google ScholarDigital Library
- H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3):260--274, Mar. 2002. Google ScholarDigital Library
- T. H. Tzen and L. M. Ni. Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers. IEEE Transactions on Parallel and Distributed Systems, 4(1):87--98, Jan. 1993. Google ScholarDigital Library
- G. Wang and X. Ren. Power-efficient work distribution method for CPU-GPU heterogeneous system. In International Symposium on Parallel and Distributed Processing with Applications (ISPA), Sept. 2010. Google ScholarDigital Library
Index Terms
- Load balancing in a changing world: dealing with heterogeneity and performance variability
Recommendations
Simplifying programming and load balancing of data parallel applications on heterogeneous systems
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing UnitHeterogeneous architectures have experienced a great development thanks to their excellent cost/performance ratio and low power consumption. But heterogeneity significantly complicates both programming and efficient use of the resources. As a result, ...
Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels
Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload ...
Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL
Heterogeneous systems are the core architecture of most of the high-performance computing nodes, due to their excellent performance and energy efficiency. However, a key challenge that remains is programmability, specifically, releasing the programmer ...
Comments