ABSTRACT
Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 265--283.Google Scholar
- Sandeep R Agrawal and Alvin R Lebeck. 2014. Rhythm : Harnessing Data Parallel Hardware for Server Workloads. In Proc. of the 19th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 19--34.Google ScholarDigital Library
- Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Proc. of the Conf. on High Performance Graphics. ACM, 145--149.Google ScholarDigital Library
- AMD. [n. d.]. AMD Kaveri. http://www.amd.com/en-us/products/processors/desktop/a-series-apu. ([n. d.]).Google Scholar
- Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. USENIX.Google Scholar
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proc. of the 2009 Int'l Symp. on Workload Characterization. IEEE, 44--54.Google ScholarDigital Library
- Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU. In Proc. of the 22nd ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. ACM, 3--16.Google ScholarDigital Library
- Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proc. of the 11th European Conference on Computer Systems. ACM Press, 1--16.Google Scholar
- Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems. In Proc. of the 19th Int'l Conf. on Parallel Architectures and Compilation Techniques. ACM, 353--364.Google ScholarDigital Library
- Node.js Foundation. 2016. Node.js. https://nodejs.org. (2016).Google Scholar
- Anshuman Goswami, Jeffrey Young, Karsten Schwan, Naila Farooqui, Ada Gavrilovska, Matthew Wolf, and Greg Eisenhauer. 2016. GPUShare: Fair-Sharing Middleware for GPU Clouds. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 1769--1776.Google Scholar
- Mathias Gottschlag, Marius Hillenbrand, Jens Kehne, Jan Stoess, and Frank Bellosa. 2013. LoGV: Low-overhead GPGPU Virtualization. In Proc. of the 4th Int'l Workshop on Frontiers of Heterogeneous Computing. IEEE, 1721--1726.Google ScholarCross Ref
- A. G. Greenberg and N. Madras. 1992. How fair is fair queuing. J. ACM 39, 3 (1992), 568--598.Google ScholarDigital Library
- K Gupta, J A Stuart, and J D Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In Proc. of the Innovative Parallel Computing. IEEE, 1--14.Google ScholarCross Ref
- Vishakha Gupta, Adam Gavrilovska, Karsten Schwan, Harshvardhan Kharche, Niraj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. 2009. GViM: GPU-Accelerated Virtual Machines. In Proc. of the 3rd Workshop on System-level Virtualization for High Performance Computing. ACM, 17--24.Google ScholarDigital Library
- Vishakha Gupta, Karsten Schwan, Niraj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. 2011. Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems. In Proc. of the 2011 USENIX Annual Technical Conf. USENIX, 31--44.Google Scholar
- Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-Accelerated Software Router. In Proc. of the ACM SIGCOMM 2010 Conf. ACM, 195--206.Google ScholarDigital Library
- Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational Joins on Graphics Processors. In Proc. of the 2008 ACM SIGMOD Int'l Conf. on Management of Data. ACM, 511--524.Google ScholarDigital Library
- Tayler H Hetherington, Mike O'Connor, and Tor M Aamodt. 2015. MemcachedGPU: Scaling-up Scale-out Key-value Stores. Proc. of the 6th ACM Symp. on Cloud Computing, 43--57.Google ScholarDigital Library
- Manato Hirabayashi, Shinpei Kato, Masato Edahiro, Kazuya Takeda, Taiki Kawano, and Seiichi Mita. 2013. GPU Implementations of Object Detection using HOG Features and Deformable Models. In Proc. of the 1st Int'l Conf. on Cyber-Physical Systems, Networks, and Applications. IEEE, 106--111.Google ScholarCross Ref
- Keon Jang, Sangjin Han, Seungyeop Han, Sue Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL Acceleration with Commodity Processors. In Proc. of the 8th USENIX Conf. on Networked Systems Design and Implementation. USENIX, 1--14.Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proc. of the 22nd ACM International Conference on Multimedia. ACM, New York, New York, USA, 675--678.Google ScholarDigital Library
- Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Proc. of the 8th Int'l Workshop on Data Management on New Hardware. ACM, 55--62.Google ScholarDigital Library
- Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proc. of the 4th Int'l Conf. on Cyber-Physical Systems. ACM/IEEE, 170--178.Google ScholarDigital Library
- Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. 2011. RGEM: A responsive GPGPU execution model for runtime engines. In Proc. of the 32nd Real-Time Systems Symposium. IEEE, 57--66.Google ScholarCross Ref
- Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments. In Proc. of the 2011 USENIX Annual Technical Conf. USENIX, 17--30.Google Scholar
- Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-Class GPU Resource Management in the Operating System. In Proc. of the 2012 USENIX Annual Technical Conf. USENIX, 401--412.Google Scholar
- Jens Kehne, Jonathan Metter, and Frank Bellosa. 2015. GPUswap: Enabling Over-subscription of GPU Memory through Transparent Swapping. In Proc. of the 11th ACM Int'l Conf. on Virtual Execution Environments. ACM, 65--77.Google Scholar
- Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proc. of the 2010 Int'l Conf. on Management of Data. ACM, 339--350.Google ScholarDigital Library
- Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 201--216.Google Scholar
- Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers. In Proc. of the 2011 Int'l Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, 11:1--11:12.Google Scholar
- Matthew McNaughton, Chris Urmson, John M. Dolan, and Jin-Woo Lee. 2011. MotionPlanning for Autonomous Driving with a Conformal Spatiotemporal Lattice. In Proc. of the 2011 Int'l Conf. on Robotics and Automation. IEEE, 4889--4895.Google ScholarCross Ref
- Konstantinos Menychtas, Kai Shen, and Michael L. Scott. 2014. Disengaged scheduling for fair, protected access to fast computational accelerators. In Proc. of the 19th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 301--316.Google Scholar
- NVIDIA. 2012. NVIDIA's next generation CUDA computer architecture: Kepler GK110. http://www.nvidia.com/. (2012).Google Scholar
- NVIDIA. 2014. CUDA Pro Tip: Occupancy API Simplifies Launch Configuration. https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/. (2014).Google Scholar
- NVIDIA. 2015. GPU-Based Deep Learning Inference: A Performance and Power Analysis. http://developer.download.nvidia.com/embedded/jetson/TX1/docs/jetson_tx1_whitepaper.pdf. (2015).Google Scholar
- NVIDIA. 2015. Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. (2015).Google Scholar
- NVIDIA. 2016. NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World's Fastest GPU. http://www.nvidia.com/object/pascal-architecture-whitepaper.html. (2016).Google Scholar
- NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. June (2017).Google Scholar
- Sreepathi Pai, Matthew J Thazhuthaveetil, and R Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In Proc. of the 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Vol. 41. ACM, 407.Google ScholarDigital Library
- N Rath, J Bialek, P J Byrne, B DeBono, J P Levesque, B Li, M E Mauel, D A Maurer, G A Navratil, and D Shiraki. 2012. High-speed, multi-input, multi-output control using GPU processing in the HBT-EP tokamak. Fusion Engineering and Design (2012), 1895--1899.Google ScholarCross Ref
- Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions To Manage GPUs as Compute Devices. In Proc. of the 23rd Symp. on Operating Systems Principles. ACM, 233--248.Google ScholarDigital Library
- Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: a Compiler and Runtime for Heterogeneous Systems. In Proc. of the 24th Symp. on Operating Systems Principles. ACM, 49--68.Google ScholarDigital Library
- Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In Proc. of the 2010 Int'l Conf. on Management of Data. ACM, 351--362.Google ScholarDigital Library
- Michael C Schatz, Cole Trapnell, Arthur L Delcher, and Amitabh Varshney. 2007. High-throughput sequence alignment using Graphics Processing Units. BMC bioinformatics 8, 1 (2007), 474.Google Scholar
- Dan Schatzberg, James Cadden, Han Dong, Orran Krieger, and Jonathan Appavoo. 2016. EbbRT: A Framework for Building Per-Application Library Operating Systems. In Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 671--688.Google Scholar
- Takashi Shimokawabe, Takayuki Aoki, Tomohiro Takaki, Toshio Endo, Akinori Yamanaka, Naoya Maruyama, Akira Nukada, and Satoshi Matsuoka. 2011. Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer. In Proc. of the 2011 Int'l Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, 3:1--3:11.Google Scholar
- Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a File System with GPUs. In Proc. of the 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 485--498.Google ScholarDigital Library
- Erik Sintorn and Ulf Assarsson. 2008. Fast parallel GPU-sorting using a hybrid algorithm. J. Parallel and Distrib. Comput. 68, 10 (2008), 1381--1388.Google ScholarDigital Library
- John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12--01 (2012).Google Scholar
- Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU Clusters. In Proc. of the 2011 Int'l Parallel & Distributed Processing Symp. IEEE, 1068--1079.Google Scholar
- Weibin Sun, Robert Ricci, and Matthew L. Curry. 2012. GPUstore. In Proc. of the 5th Annual Int'l Systems and Storage Conf. ACM, 1--12.Google Scholar
- Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why Not Virtualizing GPUs at the Hypervisor?. In Proc. of the 2014 USENIX Annual Technical Conf. USENIX, 109--120.Google Scholar
- Kun Tian, Yaozu Dong, and David Cowperthwaite. 2014. A Full GPU Virtualization Solution with Mediated Pass-Through. In Proc. of the 2014 USENIX Annual Technical Conf. USENIX, 121--132.Google ScholarDigital Library
- Lingyuan Wang, Miaoqing Huang, and Tarek El-Ghazawi. 2011. Exploiting Concurrent Kernel Execution on Graphic Processing Units. In Proc. of the Int'l Conf. on High Performance Computing and Simulation. IEEE, 24--32.Google ScholarCross Ref
- Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. 2017. FLEP: Enabling Flexible and Efficient Preemption on GPUs. In Proc. of the 22nd Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 483--496.Google ScholarDigital Library
- Lior Zeno, Avi Mendelson, and Mark Silberstein. 2016. GPUpIO: The Case for I/O-Driven Preemption on GPUs. In Proc. of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit. ACM, 63--71.Google ScholarDigital Library
Index Terms
- GLoop: an event-driven runtime for consolidating GPGPU applications
Recommendations
A comparative study of an X-ray tomography reconstruction algorithm in accelerated and cloud computing systems
With the increase of resolution in medical image scanners and the need of faster reconstruction methods, new ways of exploiting the inherent parallelism of reconstruction algorithms have arisen. In this paper, we present Mangoose++, an application to ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Virtualizing General Purpose GPUs for High Performance Cloud Computing: An Application to a Fluid Simulator
ISPA '12: Proceedings of the 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with ApplicationsIn this work we present an hypervisor-independent GPU Virtualization Service named GVirtus. It instantiates virtual machines able to access to the GPU in a transparent way. GPUs allow to speed up calculations over CPUs. Therefore, virtualizing GPUs is a ...
Comments