research-article

Public Access

GLoop: an event-driven runtime for consolidating GPGPU applications

Authors:
Yusuke Suzuki

Keio University

Keio University
View Profile

,
Hiroshi Yamada

TUAT

TUAT
View Profile

,
Shinpei Kato

The University of Tokyo

The University of Tokyo
View Profile

,
Kenji Kono

Keio University

Keio University
View Profile

SoCC '17: Proceedings of the 2017 Symposium on Cloud ComputingSeptember 2017Pages 80–93https://doi.org/10.1145/3127479.3132023

Published:24 September 2017Publication History

SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

Pages 80–93

ABSTRACT

Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.

References

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 265--283.Google Scholar
Sandeep R Agrawal and Alvin R Lebeck. 2014. Rhythm : Harnessing Data Parallel Hardware for Server Workloads. In Proc. of the 19th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 19--34.Google ScholarDigital Library
Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Proc. of the Conf. on High Performance Graphics. ACM, 145--149.Google ScholarDigital Library
AMD. [n. d.]. AMD Kaveri. http://www.amd.com/en-us/products/processors/desktop/a-series-apu. ([n. d.]).Google Scholar
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. USENIX.Google Scholar
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proc. of the 2009 Int'l Symp. on Workload Characterization. IEEE, 44--54.Google ScholarDigital Library
Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU. In Proc. of the 22nd ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. ACM, 3--16.Google ScholarDigital Library
Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proc. of the 11th European Conference on Computer Systems. ACM Press, 1--16.Google Scholar
Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems. In Proc. of the 19th Int'l Conf. on Parallel Architectures and Compilation Techniques. ACM, 353--364.Google ScholarDigital Library
Node.js Foundation. 2016. Node.js. https://nodejs.org. (2016).Google Scholar
Anshuman Goswami, Jeffrey Young, Karsten Schwan, Naila Farooqui, Ada Gavrilovska, Matthew Wolf, and Greg Eisenhauer. 2016. GPUShare: Fair-Sharing Middleware for GPU Clouds. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 1769--1776.Google Scholar
Mathias Gottschlag, Marius Hillenbrand, Jens Kehne, Jan Stoess, and Frank Bellosa. 2013. LoGV: Low-overhead GPGPU Virtualization. In Proc. of the 4th Int'l Workshop on Frontiers of Heterogeneous Computing. IEEE, 1721--1726.Google ScholarCross Ref
A. G. Greenberg and N. Madras. 1992. How fair is fair queuing. J. ACM 39, 3 (1992), 568--598.Google ScholarDigital Library
K Gupta, J A Stuart, and J D Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In Proc. of the Innovative Parallel Computing. IEEE, 1--14.Google ScholarCross Ref
Vishakha Gupta, Adam Gavrilovska, Karsten Schwan, Harshvardhan Kharche, Niraj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. 2009. GViM: GPU-Accelerated Virtual Machines. In Proc. of the 3rd Workshop on System-level Virtualization for High Performance Computing. ACM, 17--24.Google ScholarDigital Library
Vishakha Gupta, Karsten Schwan, Niraj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. 2011. Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems. In Proc. of the 2011 USENIX Annual Technical Conf. USENIX, 31--44.Google Scholar
Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-Accelerated Software Router. In Proc. of the ACM SIGCOMM 2010 Conf. ACM, 195--206.Google ScholarDigital Library
Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational Joins on Graphics Processors. In Proc. of the 2008 ACM SIGMOD Int'l Conf. on Management of Data. ACM, 511--524.Google ScholarDigital Library
Tayler H Hetherington, Mike O'Connor, and Tor M Aamodt. 2015. MemcachedGPU: Scaling-up Scale-out Key-value Stores. Proc. of the 6th ACM Symp. on Cloud Computing, 43--57.Google ScholarDigital Library
Manato Hirabayashi, Shinpei Kato, Masato Edahiro, Kazuya Takeda, Taiki Kawano, and Seiichi Mita. 2013. GPU Implementations of Object Detection using HOG Features and Deformable Models. In Proc. of the 1st Int'l Conf. on Cyber-Physical Systems, Networks, and Applications. IEEE, 106--111.Google ScholarCross Ref
Keon Jang, Sangjin Han, Seungyeop Han, Sue Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL Acceleration with Commodity Processors. In Proc. of the 8th USENIX Conf. on Networked Systems Design and Implementation. USENIX, 1--14.Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proc. of the 22nd ACM International Conference on Multimedia. ACM, New York, New York, USA, 675--678.Google ScholarDigital Library
Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Proc. of the 8th Int'l Workshop on Data Management on New Hardware. ACM, 55--62.Google ScholarDigital Library
Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proc. of the 4th Int'l Conf. on Cyber-Physical Systems. ACM/IEEE, 170--178.Google ScholarDigital Library
Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. 2011. RGEM: A responsive GPGPU execution model for runtime engines. In Proc. of the 32nd Real-Time Systems Symposium. IEEE, 57--66.Google ScholarCross Ref
Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments. In Proc. of the 2011 USENIX Annual Technical Conf. USENIX, 17--30.Google Scholar
Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-Class GPU Resource Management in the Operating System. In Proc. of the 2012 USENIX Annual Technical Conf. USENIX, 401--412.Google Scholar
Jens Kehne, Jonathan Metter, and Frank Bellosa. 2015. GPUswap: Enabling Over-subscription of GPU Memory through Transparent Swapping. In Proc. of the 11th ACM Int'l Conf. on Virtual Execution Environments. ACM, 65--77.Google Scholar
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proc. of the 2010 Int'l Conf. on Management of Data. ACM, 339--350.Google ScholarDigital Library
Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 201--216.Google Scholar
Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers. In Proc. of the 2011 Int'l Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, 11:1--11:12.Google Scholar
Matthew McNaughton, Chris Urmson, John M. Dolan, and Jin-Woo Lee. 2011. MotionPlanning for Autonomous Driving with a Conformal Spatiotemporal Lattice. In Proc. of the 2011 Int'l Conf. on Robotics and Automation. IEEE, 4889--4895.Google ScholarCross Ref
Konstantinos Menychtas, Kai Shen, and Michael L. Scott. 2014. Disengaged scheduling for fair, protected access to fast computational accelerators. In Proc. of the 19th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 301--316.Google Scholar
NVIDIA. 2012. NVIDIA's next generation CUDA computer architecture: Kepler GK110. http://www.nvidia.com/. (2012).Google Scholar
NVIDIA. 2014. CUDA Pro Tip: Occupancy API Simplifies Launch Configuration. https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/. (2014).Google Scholar
NVIDIA. 2015. GPU-Based Deep Learning Inference: A Performance and Power Analysis. http://developer.download.nvidia.com/embedded/jetson/TX1/docs/jetson_tx1_whitepaper.pdf. (2015).Google Scholar
NVIDIA. 2015. Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. (2015).Google Scholar
NVIDIA. 2016. NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World's Fastest GPU. http://www.nvidia.com/object/pascal-architecture-whitepaper.html. (2016).Google Scholar
NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. June (2017).Google Scholar
Sreepathi Pai, Matthew J Thazhuthaveetil, and R Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In Proc. of the 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Vol. 41. ACM, 407.Google ScholarDigital Library
N Rath, J Bialek, P J Byrne, B DeBono, J P Levesque, B Li, M E Mauel, D A Maurer, G A Navratil, and D Shiraki. 2012. High-speed, multi-input, multi-output control using GPU processing in the HBT-EP tokamak. Fusion Engineering and Design (2012), 1895--1899.Google ScholarCross Ref
Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions To Manage GPUs as Compute Devices. In Proc. of the 23rd Symp. on Operating Systems Principles. ACM, 233--248.Google ScholarDigital Library
Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: a Compiler and Runtime for Heterogeneous Systems. In Proc. of the 24th Symp. on Operating Systems Principles. ACM, 49--68.Google ScholarDigital Library
Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In Proc. of the 2010 Int'l Conf. on Management of Data. ACM, 351--362.Google ScholarDigital Library
Michael C Schatz, Cole Trapnell, Arthur L Delcher, and Amitabh Varshney. 2007. High-throughput sequence alignment using Graphics Processing Units. BMC bioinformatics 8, 1 (2007), 474.Google Scholar
Dan Schatzberg, James Cadden, Han Dong, Orran Krieger, and Jonathan Appavoo. 2016. EbbRT: A Framework for Building Per-Application Library Operating Systems. In Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 671--688.Google Scholar
Takashi Shimokawabe, Takayuki Aoki, Tomohiro Takaki, Toshio Endo, Akinori Yamanaka, Naoya Maruyama, Akira Nukada, and Satoshi Matsuoka. 2011. Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer. In Proc. of the 2011 Int'l Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, 3:1--3:11.Google Scholar
Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a File System with GPUs. In Proc. of the 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 485--498.Google ScholarDigital Library
Erik Sintorn and Ulf Assarsson. 2008. Fast parallel GPU-sorting using a hybrid algorithm. J. Parallel and Distrib. Comput. 68, 10 (2008), 1381--1388.Google ScholarDigital Library
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12--01 (2012).Google Scholar
Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU Clusters. In Proc. of the 2011 Int'l Parallel & Distributed Processing Symp. IEEE, 1068--1079.Google Scholar
Weibin Sun, Robert Ricci, and Matthew L. Curry. 2012. GPUstore. In Proc. of the 5th Annual Int'l Systems and Storage Conf. ACM, 1--12.Google Scholar
Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why Not Virtualizing GPUs at the Hypervisor?. In Proc. of the 2014 USENIX Annual Technical Conf. USENIX, 109--120.Google Scholar
Kun Tian, Yaozu Dong, and David Cowperthwaite. 2014. A Full GPU Virtualization Solution with Mediated Pass-Through. In Proc. of the 2014 USENIX Annual Technical Conf. USENIX, 121--132.Google ScholarDigital Library
Lingyuan Wang, Miaoqing Huang, and Tarek El-Ghazawi. 2011. Exploiting Concurrent Kernel Execution on Graphic Processing Units. In Proc. of the Int'l Conf. on High Performance Computing and Simulation. IEEE, 24--32.Google ScholarCross Ref
Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. 2017. FLEP: Enabling Flexible and Efficient Preemption on GPUs. In Proc. of the 22nd Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 483--496.Google ScholarDigital Library
Lior Zeno, Avi Mendelson, and Mark Silberstein. 2016. GPUpIO: The Case for I/O-Driven Preemption on GPUs. In Proc. of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit. ACM, 63--71.Google ScholarDigital Library

Index Terms

GLoop: an event-driven runtime for consolidating GPGPU applications
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
    2. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

A comparative study of an X-ray tomography reconstruction algorithm in accelerated and cloud computing systems

With the increase of resolution in medical image scanners and the need of faster reconstruction methods, new ways of exploiting the inherent parallelism of reconstruction algorithms have arisen. In this paper, we present Mangoose++, an application to ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Virtualizing General Purpose GPUs for High Performance Cloud Computing: An Application to a Fluid Simulator
ISPA '12: Proceedings of the 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications

In this work we present an hypervisor-independent GPU Virtualization Service named GVirtus. It instantiates virtual machines able to access to the GPU in a transparent way. GPUs allow to speed up calculations over CPUs. Therefore, virtualizing GPUs is a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing
September 2017
672 pages
ISBN:9781450350280
DOI:10.1145/3127479

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 September 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
cloud computing
operating systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate169of722submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 581
  Total Downloads
- Downloads (Last 12 months)116
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GLoop: an event-driven runtime for consolidating GPGPU applications

SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A comparative study of an X-ray tomography reconstruction algorithm in accelerated and cloud computing systems

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Virtualizing General Purpose GPUs for High Performance Cloud Computing: An Application to a Fluid Simulator

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GLoop: an event-driven runtime for consolidating GPGPU applications

SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A comparative study of an X-ray tomography reconstruction algorithm in accelerated and cloud computing systems

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Virtualizing General Purpose GPUs for High Performance Cloud Computing: An Application to a Fluid Simulator

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media