research-article

CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs

Authors:

Mehmet E. Belviranli,

Farzad Khorasani,

Laxmi N. Bhuyan,

Rajiv GuptaAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 31, Pages 1 - 12

https://doi.org/10.1145/2925426.2926271

Published: 01 June 2016 Publication History

Abstract

Recent generations of GPUs and their corresponding APIs provide means for sharing compute resources among multiple applications with greater efficiency than ever. This advance has enabled the GPUs to act as shared computation resources in multi-user environments, like supercomputers and cloud computing. Recent research has focused on maximizing the utilization of GPU computing resources by simultaneously executing multiple GPU applications (i.e., concurrent kernels) via temporal or spatial partitioning. However, they have not considered maximizing the utilization of the PCI-e bus which is equally important as applications spend a considerable amount of time on data transfers.

In this paper, we present a complete execution framework, CuMAS, to enable `data-transfer aware' sharing of GPUs across multiple CUDA applications. We develop a novel host-side CUDA task scheduler and a corresponding runtime, to capture multiple CUDA calls and re-order them for improved overall system utilization. Different from the preceding studies, CuMAS scheduler treats PCI-e up-link & down-link buses and the GPU itself as separate resources. It schedules corresponding phases of CUDA applications so that the total resource utilization is maximized. We demonstrate that the data-transfer aware nature of CuMAS framework improves the throughput of simultaneously executed CUDA applications by up to 44% when run on NVIDIA K40c GPU using applications from CUDA SDK and Rodinia benchmark suite.

References

[1]

J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for gpgpu spatial multitasking. In HPCA'12, pages 1--12. IEEE, 2012.

Digital Library

[2]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for gpu architectures. In PPoPP '10, pages 105--114. ACM, 2010.

Digital Library

[3]

M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar. A virtual memory based runtime to support multi-tenancy in clusters with gpus. In HPDC'12, pages 97--108. ACM, 2012.

Digital Library

[4]

P. Boudier and G. Sellers. Memory system on fusion apus - the benefits of zero copy. In AMD fusion developer summit. AMD, 2011.

[5]

P. Brucker. Shop Scheduling Problems, volume 3. Springer, 2007.

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC'09, pages 44--54. IEEE.

Digital Library

[7]

M. Held and R. M. Karp. A dynamic programming approach to sequencing problems. Journal of the Society for Industrial & Applied Mathematics, 10(1):196--210, 1962.

[8]

S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ISCA'09, pages 152--163. ACM, 2009.

Digital Library

[9]

H. P. Huynh, A. Hagiescu, W.-F. Wong, and R. S. M. Goh. Scalable framework for mapping streaming applications onto multi-gpu systems. In PPoPP'12, volume 47, pages 1--10. ACM, 2012.

Digital Library

[10]

T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for cpu-gpu architectures. In CGO'12, pages 165--174. ACM, 2012.

Digital Library

[11]

A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, and C. R. Das. Application-aware memory system for fair and efficient execution of concurrent gpgpu applications. In GP-GPU'14. ACM, 2014.

[12]

S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In USENIX ATC'11, page 17, 2011.

Digital Library

[13]

E. L. Lawler, J. K. Lenstra, A. H. R. Kan, and D. B. Shmoys. Sequencing and scheduling: Algorithms and complexity. Handbooks in operations research and management science, 4:445--522, 1993.

[14]

J. Y. Leung, O. Vornberger, and J. D. Witthoff. On some variants of the bandwidth minimization problem. SIAM Journal on Computing, 13(3):650--667, 1984.

Digital Library

[15]

D. Lustig and M. Martonosi. Reducing gpu offload latency via fine-grained cpu-gpu synchronization. In HPCA'13, pages 354--365. IEEE, 2013.

Digital Library

[16]

T. Lutz, C. Fensch, and M. Cole. Helium: a transparent inter-kernel optimizer for opencl. In GP-GPU'15, pages 70--80. ACM, 2015.

Digital Library

[17]

NVIDIA. Nvidia kepler gk110 architecture whitepaper. http://www.nvidia.com/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf.

[18]

NVIDIA. Nvidia quadro dual copy engines. https://www.nvidia.com/docs/IO/40049/Dual_copy_engines.pdf.

[19]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In ASPLOS'13, pages 407--418. ACM, 2013.

Digital Library

[20]

J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared gpu. In ASPLOS'15, pages 593--606. ACM, 2015.

Digital Library

[21]

V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting gpu sharing in cloud environments with a transparent runtime consolidation framework. In HPDC'11, pages 217--228. ACM, 2011.

Digital Library

[22]

C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. Ptask: operating system abstractions to manage gpus as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP), pages 233--248. ACM, 2011.

Digital Library

[23]

K. Sajjapongse, X. Wang, and M. Becchi. A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with gpus. In HPDC'13, pages 179--190. ACM, 2013.

Digital Library

[24]

D. Sengupta, A. Goswami, K. Schwan, and K. Pallavi. Scheduling multi-tenant cloud workloads on accelerator-based systems. In SC'14, pages 513--524. IEEE, 2014.

Digital Library

[25]

H. Seo, J. Kim, and M.-S. Kim. Gstream: a graph streaming processing method for large-scale graphs on gpus. In PPoPP'15, pages 253--254. ACM, 2015.

Digital Library

[26]

I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on gpus. In ISCA'14, pages 193--204. IEEE, 2014.

Digital Library

[27]

C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi gf100 gpu architecture. IEEE Micro, (2):50--59, 2011.

Digital Library

[28]

G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou. Gpgpu performance and power estimation using machine learning. In HPCA'15, pages 564--576. IEEE, 2015.

Cited By

Robroek TYousefzadeh-Asl-Miandoab ETözün P(2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642970.3655827
Dagli IBelviranli MLee IChabbi MSteuwer M(2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638502
Dhakal AKulkarni SRamakrishnan K(2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
https://doi.org/10.1109/TCC.2024.3476210
Show More Cited By

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
299
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Robroek TYousefzadeh-Asl-Miandoab ETözün P(2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642970.3655827
Dagli IBelviranli MLee IChabbi MSteuwer M(2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638502
Dhakal AKulkarni SRamakrishnan K(2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
https://doi.org/10.1109/TCC.2024.3476210
Gilman GOgden SGuo TWalls R(2021)Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent KernelsACM SIGMETRICS Performance Evaluation Review10.1145/3453953.345397248:3(81-88)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3453953.3453972
Zheng RPai SLee J(2021)Efficient execution of graph algorithms on CPU with SIMD extensionsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370326(262-276)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370326
Gilman GWalls R(2021)Characterizing concurrency mechanisms for NVIDIA GPUs under deep learning workloadsPerformance Evaluation10.1016/j.peva.2021.102234151:COnline publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1016/j.peva.2021.102234
Khalid YAleem MAhmed UProdan RIslam MIqbal M(2021)FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performanceComputing10.1007/s00607-021-00958-2Online publication date: 3-Jun-2021
https://doi.org/10.1007/s00607-021-00958-2
Monil MBelviranli MLee SVetter JMalony ASarkar VKim H(2020)MEPHESTOProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414671(413-425)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414671
Tomoutzoglou OMbakoyiannis DKornaros GCoppola M(2020)Efficient Job Offloading in Heterogeneous Systems Through Hardware-Assisted Packet-Based Dispatching and User-Level Runtime InfrastructureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.290791239:5(1017-1030)Online publication date: May-2020
https://doi.org/10.1109/TCAD.2019.2907912
Luley RQiu Q(2020)A Deep Q-Learning Approach for GPU Task Scheduling2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286238(1-7)Online publication date: 22-Sep-2020
https://doi.org/10.1109/HPEC43674.2020.9286238
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten