skip to main content
10.1145/2925426.2926271acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs

Published: 01 June 2016 Publication History

Abstract

Recent generations of GPUs and their corresponding APIs provide means for sharing compute resources among multiple applications with greater efficiency than ever. This advance has enabled the GPUs to act as shared computation resources in multi-user environments, like supercomputers and cloud computing. Recent research has focused on maximizing the utilization of GPU computing resources by simultaneously executing multiple GPU applications (i.e., concurrent kernels) via temporal or spatial partitioning. However, they have not considered maximizing the utilization of the PCI-e bus which is equally important as applications spend a considerable amount of time on data transfers.
In this paper, we present a complete execution framework, CuMAS, to enable `data-transfer aware' sharing of GPUs across multiple CUDA applications. We develop a novel host-side CUDA task scheduler and a corresponding runtime, to capture multiple CUDA calls and re-order them for improved overall system utilization. Different from the preceding studies, CuMAS scheduler treats PCI-e up-link & down-link buses and the GPU itself as separate resources. It schedules corresponding phases of CUDA applications so that the total resource utilization is maximized. We demonstrate that the data-transfer aware nature of CuMAS framework improves the throughput of simultaneously executed CUDA applications by up to 44% when run on NVIDIA K40c GPU using applications from CUDA SDK and Rodinia benchmark suite.

References

[1]
J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for gpgpu spatial multitasking. In HPCA'12, pages 1--12. IEEE, 2012.
[2]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for gpu architectures. In PPoPP '10, pages 105--114. ACM, 2010.
[3]
M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar. A virtual memory based runtime to support multi-tenancy in clusters with gpus. In HPDC'12, pages 97--108. ACM, 2012.
[4]
P. Boudier and G. Sellers. Memory system on fusion apus - the benefits of zero copy. In AMD fusion developer summit. AMD, 2011.
[5]
P. Brucker. Shop Scheduling Problems, volume 3. Springer, 2007.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC'09, pages 44--54. IEEE.
[7]
M. Held and R. M. Karp. A dynamic programming approach to sequencing problems. Journal of the Society for Industrial & Applied Mathematics, 10(1):196--210, 1962.
[8]
S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ISCA'09, pages 152--163. ACM, 2009.
[9]
H. P. Huynh, A. Hagiescu, W.-F. Wong, and R. S. M. Goh. Scalable framework for mapping streaming applications onto multi-gpu systems. In PPoPP'12, volume 47, pages 1--10. ACM, 2012.
[10]
T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for cpu-gpu architectures. In CGO'12, pages 165--174. ACM, 2012.
[11]
A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, and C. R. Das. Application-aware memory system for fair and efficient execution of concurrent gpgpu applications. In GP-GPU'14. ACM, 2014.
[12]
S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In USENIX ATC'11, page 17, 2011.
[13]
E. L. Lawler, J. K. Lenstra, A. H. R. Kan, and D. B. Shmoys. Sequencing and scheduling: Algorithms and complexity. Handbooks in operations research and management science, 4:445--522, 1993.
[14]
J. Y. Leung, O. Vornberger, and J. D. Witthoff. On some variants of the bandwidth minimization problem. SIAM Journal on Computing, 13(3):650--667, 1984.
[15]
D. Lustig and M. Martonosi. Reducing gpu offload latency via fine-grained cpu-gpu synchronization. In HPCA'13, pages 354--365. IEEE, 2013.
[16]
T. Lutz, C. Fensch, and M. Cole. Helium: a transparent inter-kernel optimizer for opencl. In GP-GPU'15, pages 70--80. ACM, 2015.
[17]
NVIDIA. Nvidia kepler gk110 architecture whitepaper. http://www.nvidia.com/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf.
[18]
NVIDIA. Nvidia quadro dual copy engines. https://www.nvidia.com/docs/IO/40049/Dual_copy_engines.pdf.
[19]
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In ASPLOS'13, pages 407--418. ACM, 2013.
[20]
J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared gpu. In ASPLOS'15, pages 593--606. ACM, 2015.
[21]
V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting gpu sharing in cloud environments with a transparent runtime consolidation framework. In HPDC'11, pages 217--228. ACM, 2011.
[22]
C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. Ptask: operating system abstractions to manage gpus as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP), pages 233--248. ACM, 2011.
[23]
K. Sajjapongse, X. Wang, and M. Becchi. A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with gpus. In HPDC'13, pages 179--190. ACM, 2013.
[24]
D. Sengupta, A. Goswami, K. Schwan, and K. Pallavi. Scheduling multi-tenant cloud workloads on accelerator-based systems. In SC'14, pages 513--524. IEEE, 2014.
[25]
H. Seo, J. Kim, and M.-S. Kim. Gstream: a graph streaming processing method for large-scale graphs on gpus. In PPoPP'15, pages 253--254. ACM, 2015.
[26]
I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on gpus. In ISCA'14, pages 193--204. IEEE, 2014.
[27]
C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi gf100 gpu architecture. IEEE Micro, (2):50--59, 2011.
[28]
G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou. Gpgpu performance and power estimation using machine learning. In HPCA'15, pages 564--576. IEEE, 2015.

Cited By

View all
  • (2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
  • (2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
  • (2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • NSF

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
  • (2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
  • (2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
  • (2021)Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent KernelsACM SIGMETRICS Performance Evaluation Review10.1145/3453953.345397248:3(81-88)Online publication date: 5-Mar-2021
  • (2021)Efficient execution of graph algorithms on CPU with SIMD extensionsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370326(262-276)Online publication date: 27-Feb-2021
  • (2021)Characterizing concurrency mechanisms for NVIDIA GPUs under deep learning workloadsPerformance Evaluation10.1016/j.peva.2021.102234151:COnline publication date: 1-Nov-2021
  • (2021)FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performanceComputing10.1007/s00607-021-00958-2Online publication date: 3-Jun-2021
  • (2020)MEPHESTOProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414671(413-425)Online publication date: 30-Sep-2020
  • (2020)Efficient Job Offloading in Heterogeneous Systems Through Hardware-Assisted Packet-Based Dispatching and User-Level Runtime InfrastructureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.290791239:5(1017-1030)Online publication date: May-2020
  • (2020)A Deep Q-Learning Approach for GPU Task Scheduling2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286238(1-7)Online publication date: 22-Sep-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media