skip to main content
10.1145/3127479.3132023acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

GLoop: an event-driven runtime for consolidating GPGPU applications

Published:24 September 2017Publication History

ABSTRACT

Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 265--283.Google ScholarGoogle Scholar
  2. Sandeep R Agrawal and Alvin R Lebeck. 2014. Rhythm : Harnessing Data Parallel Hardware for Server Workloads. In Proc. of the 19th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 19--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Proc. of the Conf. on High Performance Graphics. ACM, 145--149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AMD. [n. d.]. AMD Kaveri. http://www.amd.com/en-us/products/processors/desktop/a-series-apu. ([n. d.]).Google ScholarGoogle Scholar
  5. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. USENIX.Google ScholarGoogle Scholar
  6. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proc. of the 2009 Int'l Symp. on Workload Characterization. IEEE, 44--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU. In Proc. of the 22nd ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. ACM, 3--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proc. of the 11th European Conference on Computer Systems. ACM Press, 1--16.Google ScholarGoogle Scholar
  9. Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems. In Proc. of the 19th Int'l Conf. on Parallel Architectures and Compilation Techniques. ACM, 353--364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Node.js Foundation. 2016. Node.js. https://nodejs.org. (2016).Google ScholarGoogle Scholar
  11. Anshuman Goswami, Jeffrey Young, Karsten Schwan, Naila Farooqui, Ada Gavrilovska, Matthew Wolf, and Greg Eisenhauer. 2016. GPUShare: Fair-Sharing Middleware for GPU Clouds. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 1769--1776.Google ScholarGoogle Scholar
  12. Mathias Gottschlag, Marius Hillenbrand, Jens Kehne, Jan Stoess, and Frank Bellosa. 2013. LoGV: Low-overhead GPGPU Virtualization. In Proc. of the 4th Int'l Workshop on Frontiers of Heterogeneous Computing. IEEE, 1721--1726.Google ScholarGoogle ScholarCross RefCross Ref
  13. A. G. Greenberg and N. Madras. 1992. How fair is fair queuing. J. ACM 39, 3 (1992), 568--598.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K Gupta, J A Stuart, and J D Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In Proc. of the Innovative Parallel Computing. IEEE, 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  15. Vishakha Gupta, Adam Gavrilovska, Karsten Schwan, Harshvardhan Kharche, Niraj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. 2009. GViM: GPU-Accelerated Virtual Machines. In Proc. of the 3rd Workshop on System-level Virtualization for High Performance Computing. ACM, 17--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Vishakha Gupta, Karsten Schwan, Niraj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. 2011. Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems. In Proc. of the 2011 USENIX Annual Technical Conf. USENIX, 31--44.Google ScholarGoogle Scholar
  17. Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-Accelerated Software Router. In Proc. of the ACM SIGCOMM 2010 Conf. ACM, 195--206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational Joins on Graphics Processors. In Proc. of the 2008 ACM SIGMOD Int'l Conf. on Management of Data. ACM, 511--524.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tayler H Hetherington, Mike O'Connor, and Tor M Aamodt. 2015. MemcachedGPU: Scaling-up Scale-out Key-value Stores. Proc. of the 6th ACM Symp. on Cloud Computing, 43--57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Manato Hirabayashi, Shinpei Kato, Masato Edahiro, Kazuya Takeda, Taiki Kawano, and Seiichi Mita. 2013. GPU Implementations of Object Detection using HOG Features and Deformable Models. In Proc. of the 1st Int'l Conf. on Cyber-Physical Systems, Networks, and Applications. IEEE, 106--111.Google ScholarGoogle ScholarCross RefCross Ref
  21. Keon Jang, Sangjin Han, Seungyeop Han, Sue Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL Acceleration with Commodity Processors. In Proc. of the 8th USENIX Conf. on Networked Systems Design and Implementation. USENIX, 1--14.Google ScholarGoogle Scholar
  22. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proc. of the 22nd ACM International Conference on Multimedia. ACM, New York, New York, USA, 675--678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Proc. of the 8th Int'l Workshop on Data Management on New Hardware. ACM, 55--62.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proc. of the 4th Int'l Conf. on Cyber-Physical Systems. ACM/IEEE, 170--178.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. 2011. RGEM: A responsive GPGPU execution model for runtime engines. In Proc. of the 32nd Real-Time Systems Symposium. IEEE, 57--66.Google ScholarGoogle ScholarCross RefCross Ref
  26. Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments. In Proc. of the 2011 USENIX Annual Technical Conf. USENIX, 17--30.Google ScholarGoogle Scholar
  27. Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-Class GPU Resource Management in the Operating System. In Proc. of the 2012 USENIX Annual Technical Conf. USENIX, 401--412.Google ScholarGoogle Scholar
  28. Jens Kehne, Jonathan Metter, and Frank Bellosa. 2015. GPUswap: Enabling Over-subscription of GPU Memory through Transparent Swapping. In Proc. of the 11th ACM Int'l Conf. on Virtual Execution Environments. ACM, 65--77.Google ScholarGoogle Scholar
  29. Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proc. of the 2010 Int'l Conf. on Management of Data. ACM, 339--350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 201--216.Google ScholarGoogle Scholar
  31. Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers. In Proc. of the 2011 Int'l Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, 11:1--11:12.Google ScholarGoogle Scholar
  32. Matthew McNaughton, Chris Urmson, John M. Dolan, and Jin-Woo Lee. 2011. MotionPlanning for Autonomous Driving with a Conformal Spatiotemporal Lattice. In Proc. of the 2011 Int'l Conf. on Robotics and Automation. IEEE, 4889--4895.Google ScholarGoogle ScholarCross RefCross Ref
  33. Konstantinos Menychtas, Kai Shen, and Michael L. Scott. 2014. Disengaged scheduling for fair, protected access to fast computational accelerators. In Proc. of the 19th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 301--316.Google ScholarGoogle Scholar
  34. NVIDIA. 2012. NVIDIA's next generation CUDA computer architecture: Kepler GK110. http://www.nvidia.com/. (2012).Google ScholarGoogle Scholar
  35. NVIDIA. 2014. CUDA Pro Tip: Occupancy API Simplifies Launch Configuration. https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/. (2014).Google ScholarGoogle Scholar
  36. NVIDIA. 2015. GPU-Based Deep Learning Inference: A Performance and Power Analysis. http://developer.download.nvidia.com/embedded/jetson/TX1/docs/jetson_tx1_whitepaper.pdf. (2015).Google ScholarGoogle Scholar
  37. NVIDIA. 2015. Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. (2015).Google ScholarGoogle Scholar
  38. NVIDIA. 2016. NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World's Fastest GPU. http://www.nvidia.com/object/pascal-architecture-whitepaper.html. (2016).Google ScholarGoogle Scholar
  39. NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. June (2017).Google ScholarGoogle Scholar
  40. Sreepathi Pai, Matthew J Thazhuthaveetil, and R Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In Proc. of the 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Vol. 41. ACM, 407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. N Rath, J Bialek, P J Byrne, B DeBono, J P Levesque, B Li, M E Mauel, D A Maurer, G A Navratil, and D Shiraki. 2012. High-speed, multi-input, multi-output control using GPU processing in the HBT-EP tokamak. Fusion Engineering and Design (2012), 1895--1899.Google ScholarGoogle ScholarCross RefCross Ref
  42. Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions To Manage GPUs as Compute Devices. In Proc. of the 23rd Symp. on Operating Systems Principles. ACM, 233--248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: a Compiler and Runtime for Heterogeneous Systems. In Proc. of the 24th Symp. on Operating Systems Principles. ACM, 49--68.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In Proc. of the 2010 Int'l Conf. on Management of Data. ACM, 351--362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Michael C Schatz, Cole Trapnell, Arthur L Delcher, and Amitabh Varshney. 2007. High-throughput sequence alignment using Graphics Processing Units. BMC bioinformatics 8, 1 (2007), 474.Google ScholarGoogle Scholar
  46. Dan Schatzberg, James Cadden, Han Dong, Orran Krieger, and Jonathan Appavoo. 2016. EbbRT: A Framework for Building Per-Application Library Operating Systems. In Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation. USENIX, 671--688.Google ScholarGoogle Scholar
  47. Takashi Shimokawabe, Takayuki Aoki, Tomohiro Takaki, Toshio Endo, Akinori Yamanaka, Naoya Maruyama, Akira Nukada, and Satoshi Matsuoka. 2011. Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer. In Proc. of the 2011 Int'l Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, 3:1--3:11.Google ScholarGoogle Scholar
  48. Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a File System with GPUs. In Proc. of the 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 485--498.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Erik Sintorn and Ulf Assarsson. 2008. Fast parallel GPU-sorting using a hybrid algorithm. J. Parallel and Distrib. Comput. 68, 10 (2008), 1381--1388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12--01 (2012).Google ScholarGoogle Scholar
  51. Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU Clusters. In Proc. of the 2011 Int'l Parallel & Distributed Processing Symp. IEEE, 1068--1079.Google ScholarGoogle Scholar
  52. Weibin Sun, Robert Ricci, and Matthew L. Curry. 2012. GPUstore. In Proc. of the 5th Annual Int'l Systems and Storage Conf. ACM, 1--12.Google ScholarGoogle Scholar
  53. Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why Not Virtualizing GPUs at the Hypervisor?. In Proc. of the 2014 USENIX Annual Technical Conf. USENIX, 109--120.Google ScholarGoogle Scholar
  54. Kun Tian, Yaozu Dong, and David Cowperthwaite. 2014. A Full GPU Virtualization Solution with Mediated Pass-Through. In Proc. of the 2014 USENIX Annual Technical Conf. USENIX, 121--132.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Lingyuan Wang, Miaoqing Huang, and Tarek El-Ghazawi. 2011. Exploiting Concurrent Kernel Execution on Graphic Processing Units. In Proc. of the Int'l Conf. on High Performance Computing and Simulation. IEEE, 24--32.Google ScholarGoogle ScholarCross RefCross Ref
  56. Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. 2017. FLEP: Enabling Flexible and Efficient Preemption on GPUs. In Proc. of the 22nd Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM, 483--496.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Lior Zeno, Avi Mendelson, and Mark Silberstein. 2016. GPUpIO: The Case for I/O-Driven Preemption on GPUs. In Proc. of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit. ACM, 63--71.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GLoop: an event-driven runtime for consolidating GPGPU applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing
        September 2017
        672 pages
        ISBN:9781450350280
        DOI:10.1145/3127479

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 September 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate169of722submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader