skip to main content
10.1145/3361525.3361538acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

FfDL: A Flexible Multi-tenant Deep Learning Platform

Authors Info & Claims
Published:09 December 2019Publication History

ABSTRACT

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale "on-premise" and "cloud-hosted" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale.

This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.

References

  1. Vinay Amatya, Abhinav Vishnu, Charles Siegel, and Jeff Daily. 2017. What does fault tolerant Deep Learning need from MPI?. In EuroMPI/USA.Google ScholarGoogle Scholar
  2. Amazon Web Services. 2017. Amazon Sagemaker. https://aws.amazon.com/sagemaker/.Google ScholarGoogle Scholar
  3. Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 1387--1395. https://doi.org/10.1145/3097983.3098021Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mahdi Nazari Cheraghlou, Ahmad Khademzadeh, and Majid Haghparast. 2016. A survey of fault tolerance architecture in cloud computing. J. Network and Computer Applications 61 (2016), 81--92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. CoreOS Inc. 2018. The ETCD Key Value Store. https://coreos.com/etcd/.Google ScholarGoogle Scholar
  6. Docker. 2019. Docker: Enterprise Application Container Platform. https://www.docker.com/.Google ScholarGoogle Scholar
  7. Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad van der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The International Exascale Software Project roadmap. The International Journal of High Performance Computing Applications 25, 1 (2011), 3--60. https://doi.org/10.1177/1094342010391989 arXiv:https://doi.org/10.1177/1094342010391989Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jeffrey Dunn. 2016. Introducing FBLearner Flow: Facebook's AI backbone. https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/.Google ScholarGoogle Scholar
  9. Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Super computing 65, 3 (01 Sep 2013), 1302--1326. https://doi.org/10.1007/s11227-013-0884-0Google ScholarGoogle Scholar
  10. Apache Software Foundation. 2019. Apache Mesos. http://mesos.apache.org.Google ScholarGoogle Scholar
  11. James Fox, Yiming Zou, and Judy Qiu. 2016. Software Frameworks for Deep Learning at Scale.Google ScholarGoogle Scholar
  12. Google. 2019. Kubernetes: Production Grade Container Orchestration.Google ScholarGoogle Scholar
  13. Google Inc. 2018. Google Cloud Machine Learning Engine. https://cloud.google.com/ml-engine/.Google ScholarGoogle Scholar
  14. Haryadi S. Gunawi, Thanh Do, Joseph M. Hellerstein, Ion Stoica, Dhruba Borthakur, and Jesse Robbins. 2011. Failure as a Service (FaaS): A Cloud Service for Large-Scale, Online Failure Drills. Technical Report UCB/EECS-2011-87. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-87.htmlGoogle ScholarGoogle Scholar
  15. Moin Hasan and Major Singh Goraya. 2018. Fault tolerance in cloud computing environment: A systematic survey. Computers in Industry 99 (2018), 156--172. https://doi.org/10.1016/j.compind.2018.03.027Google ScholarGoogle ScholarCross RefCross Ref
  16. Jeremy Hermann and Mike Del Baso. 2017. Meet Michelangelo: Uber's Machine Learning Platform. https://eng.uber.com/michelangelo/.Google ScholarGoogle Scholar
  17. IBM Corporation. 2018. IBM Watson Machine Learning. https://developer.ibm.com/clouddataservices/docs/ibm-watson-machine-learning/.Google ScholarGoogle Scholar
  18. IBM Inc. 2018. The IBM Cloud. https://www.ibm.com/cloud/bare-metal-servers.Google ScholarGoogle Scholar
  19. Google Inc. 2018. Tensorflow: An open-source machine learning framework for everyone. https://www.tensorflow.org/.Google ScholarGoogle Scholar
  20. Google Inc. 2018. TensorFlow CNN Benchmarks. https://www.nytimes.com/2018/04/06/opinion/sunday/germs-microbes-processed-foods.html.Google ScholarGoogle Scholar
  21. NVIDIA Inc. 2018. NVIDIA DGX-1: Essential Instrument Of AI Research. https://www.nvidia.com/en-us/data-center/dgx-1/.Google ScholarGoogle Scholar
  22. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).Google ScholarGoogle Scholar
  23. Justin C. Johnson. 2018. CNN Benchmarks. https://github.com/jcjohnson/cnn-benchmarks.Google ScholarGoogle Scholar
  24. Shaoqing Ren Kaiming He, Xiangyu Zhang and Jian Sun. 2015. Deep Residual Networks. https://github.com/KaimingHe/deep-residual-networks.Google ScholarGoogle Scholar
  25. Israel Koren and C. Mani Krishna. 2007. Fault-Tolerant Systems (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Google ScholarGoogle Scholar
  26. Tim Kraska, Ameet Talwalkar, and John Duchi. 2013. Mlbase: A distributed machine-learning system. In In CIDR.Google ScholarGoogle Scholar
  27. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. In Nature, Vol. 521.436--444.Google ScholarGoogle ScholarCross RefCross Ref
  28. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 583--598. http://dl.acm.org/citation.cfm?id=2685048.2685095Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (April 2012), 716--727. https://doi.org/10.14778/2212351.2212354Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Microsoft Azure. 2018. Microsoft Azure Machine Learning. https://azure.microsoft.com/en-us/overview/machine-learning/.Google ScholarGoogle Scholar
  31. Mongo Inc. 2018. MongoDB. https://www.mongodb.com/.Google ScholarGoogle Scholar
  32. Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 439--455. https://doi.org/10.1145/2517349.2522738Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. NVIDIA Inc. 2014. TESLA K80. https://www.nvidia.com/en-us/data-center/teslak80/.Google ScholarGoogle Scholar
  34. NVIDIA Inc. 2018. TESLA P100: Infinite Compute Power for the Modern Data Center. http://www.nvidia.com/object/tesla-p100.html.Google ScholarGoogle Scholar
  35. Xinghao Pan, Shivaram Venkataraman, Zizheng Tai, and Joseph Gonzalez. 2017. Hemingway: Modeling Distributed Optimization Algorithms. arXiv preprint arXiv: 1702.05865 (2017).Google ScholarGoogle Scholar
  36. Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch, and Gregory R. Ganger. 2018. 3Sigma: Distribution-based Cluster Scheduling for Runtime Uncertainty. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). ACM, New York, NY, USA, Article 2, 17 pages. https://doi.org/10.1145/3190508.3190515Google ScholarGoogle Scholar
  37. Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). ACM. https://doi.org/10.1145/3190508.3190517Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Altino M. Sampaio and Jorge G. Barbosa. 2017. A Comparative Cost Study of Fault-Tolerant Techniques for Availability on the Cloud. In Ambient Intelligence - Software and Applications - 8th International Symposium on Ambient Intelligence, ISAmI 2017, Porto, Portugal, June 21-23, 2017. 263--268. https://doi.org/10.1007/978-3-319-61118-1_32Google ScholarGoogle Scholar
  39. Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799 http://arxiv.org/abs/1802.05799Google ScholarGoogle Scholar
  40. Yogesh Sharma, Bahman Javadi, Weisheng Si, and Daniel Sun. 2016. Reliability and Energy Efficiency in Cloud Computing Systems. J. Netw. Comput. Appl. 74, C (Oct. 2016), 66--85. https://doi.org/10.1016/j.jnca.2016.08.010Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. http://www.robots.ox.ac.uk/vgg/research/very_deep/.Google ScholarGoogle Scholar
  42. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. CoRR abs/1512.00567 (2015). arXiv:1512.00567 http://arxiv.org/abs/1512.00567Google ScholarGoogle Scholar
  43. Asser N. Tantawi. 2015. On biasing towards optimized application placement in the cloud. In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2015 IEEE 23rd International Symposium on.Google ScholarGoogle Scholar
  44. Asser N. Tantawi. 2016. Solution biasing for optimized cloud workload placement. In Autonomic Computing (ICAC), 2016 IEEE 13th International Conference on.Google ScholarGoogle ScholarCross RefCross Ref
  45. Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-scale Advanced Analytics. In 13th Usenix Conference on Networked Systems Design and Implementation (NSDI'16). USENIX Association, 363--378.Google ScholarGoogle Scholar
  46. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. 2008. Proactive Process-level Live Migration in HPC Environments. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 43, 12 pages. http://dl.acm.org/citation.cfm?id=1413370.1413414Google ScholarGoogle Scholar
  47. Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, et al. 2015. REEF: Retainable Evaluator Execution Framework. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1343--1355.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 595--610. https://www.usenix.org/conference/osdi18/presentation/xiaoGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  49. Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data.. In KDD, Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams (Eds.). ACM, 1335--1344. http://dblp.uni-trier.de/db/conf/kdd/kdd2015.html#XingHDKWLZXKY15Google ScholarGoogle Scholar
  50. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zahariaGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  51. Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J. Freedman. 2017. SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. In ACM Symposium on Cloud Computing. 390--404. https://doi.org/10.1145/3127479.3127490Google ScholarGoogle Scholar
  52. K. Zhang, S. Alqahtani, and M. Demirbas. 2017. A Comparison of Distributed Machine Learning Platforms. In 2017 26th International Conference on Computer Communication and Networks (ICCCN). 1--9. https://doi.org/10.1109/ICCCN.2017.8038464Google ScholarGoogle Scholar

Index Terms

  1. FfDL: A Flexible Multi-tenant Deep Learning Platform

            Recommendations

            Reviews

            Brijendra Singh

            Cloud computing is becoming the preferred choice in many domains due to availability, flexibility, and scalability, as well as many other reasons. However, this also brings challenges, that is, adapting cloud computing to specific domains like deep learning (DL), at large scale, for maintaining and executing learning jobs in the cloud environment. Precisely written, this paper covers a flexible DL platform used at IBM. This platform has been open sourced to the community as well. The authors clearly outline the challenges faced during installation, configuration, and fault tolerance for DL infrastructure. At the same time, they highlight key shortcomings for managing workloads and the need for a middleware platform, specialized "to support the distributed training of DL models in the cloud." This paper captures the architectural design and components of FfDL, which is "a cloud-hosted and multi-tenant dependable distributed DL platform used to train DL models at IBM." A study of the performance overhead on running bare metal hardware versus a cloud-hosted environment and failure analysis reveals details of various components in the running platform. This study should interest those involved in DL infrastructure setup and enhancement. It clearly sets the foundation for further study and research to make specialized domains more robust, reliable, and distributed with the highest performance hosted in a cloud environment.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              Middleware '19: Proceedings of the 20th International Middleware Conference
              December 2019
              342 pages
              ISBN:9781450370097
              DOI:10.1145/3361525

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 9 December 2019

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate203of948submissions,21%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader