research-article

FfDL: A Flexible Multi-tenant Deep Learning Platform

Authors:
K. R. Jayaram

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Vinod Muthusamy

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Parijat Dube

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Vatche Ishakian

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Chen Wang

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Benjamin Herta

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Scott Boag

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Diana Arroyo

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Asser Tantawi

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Archit Verma

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Falk Pollok

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

,
Rania Khalaf

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA

IBM Research, Yorktown Heights, NY and Cambridge, MA, USA
View Profile

Middleware '19: Proceedings of the 20th International Middleware ConferenceDecember 2019Pages 82–95https://doi.org/10.1145/3361525.3361538

Published:09 December 2019Publication History

Middleware '19: Proceedings of the 20th International Middleware Conference

Pages 82–95

ABSTRACT

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale "on-premise" and "cloud-hosted" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale.

This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.

References

Vinay Amatya, Abhinav Vishnu, Charles Siegel, and Jeff Daily. 2017. What does fault tolerant Deep Learning need from MPI?. In EuroMPI/USA.Google Scholar
Amazon Web Services. 2017. Amazon Sagemaker. https://aws.amazon.com/sagemaker/.Google Scholar
Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 1387--1395. https://doi.org/10.1145/3097983.3098021Google ScholarDigital Library
Mahdi Nazari Cheraghlou, Ahmad Khademzadeh, and Majid Haghparast. 2016. A survey of fault tolerance architecture in cloud computing. J. Network and Computer Applications 61 (2016), 81--92.Google ScholarDigital Library
CoreOS Inc. 2018. The ETCD Key Value Store. https://coreos.com/etcd/.Google Scholar
Docker. 2019. Docker: Enterprise Application Container Platform. https://www.docker.com/.Google Scholar
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad van der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The International Exascale Software Project roadmap. The International Journal of High Performance Computing Applications 25, 1 (2011), 3--60. https://doi.org/10.1177/1094342010391989 arXiv:https://doi.org/10.1177/1094342010391989Google ScholarDigital Library
Jeffrey Dunn. 2016. Introducing FBLearner Flow: Facebook's AI backbone. https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/.Google Scholar
Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Super computing 65, 3 (01 Sep 2013), 1302--1326. https://doi.org/10.1007/s11227-013-0884-0Google Scholar
Apache Software Foundation. 2019. Apache Mesos. http://mesos.apache.org.Google Scholar
James Fox, Yiming Zou, and Judy Qiu. 2016. Software Frameworks for Deep Learning at Scale.Google Scholar
Google. 2019. Kubernetes: Production Grade Container Orchestration.Google Scholar
Google Inc. 2018. Google Cloud Machine Learning Engine. https://cloud.google.com/ml-engine/.Google Scholar
Haryadi S. Gunawi, Thanh Do, Joseph M. Hellerstein, Ion Stoica, Dhruba Borthakur, and Jesse Robbins. 2011. Failure as a Service (FaaS): A Cloud Service for Large-Scale, Online Failure Drills. Technical Report UCB/EECS-2011-87. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-87.htmlGoogle Scholar
Moin Hasan and Major Singh Goraya. 2018. Fault tolerance in cloud computing environment: A systematic survey. Computers in Industry 99 (2018), 156--172. https://doi.org/10.1016/j.compind.2018.03.027Google ScholarCross Ref
Jeremy Hermann and Mike Del Baso. 2017. Meet Michelangelo: Uber's Machine Learning Platform. https://eng.uber.com/michelangelo/.Google Scholar
IBM Corporation. 2018. IBM Watson Machine Learning. https://developer.ibm.com/clouddataservices/docs/ibm-watson-machine-learning/.Google Scholar
IBM Inc. 2018. The IBM Cloud. https://www.ibm.com/cloud/bare-metal-servers.Google Scholar
Google Inc. 2018. Tensorflow: An open-source machine learning framework for everyone. https://www.tensorflow.org/.Google Scholar
Google Inc. 2018. TensorFlow CNN Benchmarks. https://www.nytimes.com/2018/04/06/opinion/sunday/germs-microbes-processed-foods.html.Google Scholar
NVIDIA Inc. 2018. NVIDIA DGX-1: Essential Instrument Of AI Research. https://www.nvidia.com/en-us/data-center/dgx-1/.Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).Google Scholar
Justin C. Johnson. 2018. CNN Benchmarks. https://github.com/jcjohnson/cnn-benchmarks.Google Scholar
Shaoqing Ren Kaiming He, Xiangyu Zhang and Jian Sun. 2015. Deep Residual Networks. https://github.com/KaimingHe/deep-residual-networks.Google Scholar
Israel Koren and C. Mani Krishna. 2007. Fault-Tolerant Systems (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Google Scholar
Tim Kraska, Ameet Talwalkar, and John Duchi. 2013. Mlbase: A distributed machine-learning system. In In CIDR.Google Scholar
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. In Nature, Vol. 521.436--444.Google ScholarCross Ref
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 583--598. http://dl.acm.org/citation.cfm?id=2685048.2685095Google ScholarDigital Library
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (April 2012), 716--727. https://doi.org/10.14778/2212351.2212354Google ScholarDigital Library
Microsoft Azure. 2018. Microsoft Azure Machine Learning. https://azure.microsoft.com/en-us/overview/machine-learning/.Google Scholar
Mongo Inc. 2018. MongoDB. https://www.mongodb.com/.Google Scholar
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 439--455. https://doi.org/10.1145/2517349.2522738Google ScholarDigital Library
NVIDIA Inc. 2014. TESLA K80. https://www.nvidia.com/en-us/data-center/teslak80/.Google Scholar
NVIDIA Inc. 2018. TESLA P100: Infinite Compute Power for the Modern Data Center. http://www.nvidia.com/object/tesla-p100.html.Google Scholar
Xinghao Pan, Shivaram Venkataraman, Zizheng Tai, and Joseph Gonzalez. 2017. Hemingway: Modeling Distributed Optimization Algorithms. arXiv preprint arXiv: 1702.05865 (2017).Google Scholar
Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch, and Gregory R. Ganger. 2018. 3Sigma: Distribution-based Cluster Scheduling for Runtime Uncertainty. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). ACM, New York, NY, USA, Article 2, 17 pages. https://doi.org/10.1145/3190508.3190515Google Scholar
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). ACM. https://doi.org/10.1145/3190508.3190517Google ScholarDigital Library
Altino M. Sampaio and Jorge G. Barbosa. 2017. A Comparative Cost Study of Fault-Tolerant Techniques for Availability on the Cloud. In Ambient Intelligence - Software and Applications - 8th International Symposium on Ambient Intelligence, ISAmI 2017, Porto, Portugal, June 21-23, 2017. 263--268. https://doi.org/10.1007/978-3-319-61118-1_32Google Scholar
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799 http://arxiv.org/abs/1802.05799Google Scholar
Yogesh Sharma, Bahman Javadi, Weisheng Si, and Daniel Sun. 2016. Reliability and Energy Efficiency in Cloud Computing Systems. J. Netw. Comput. Appl. 74, C (Oct. 2016), 66--85. https://doi.org/10.1016/j.jnca.2016.08.010Google ScholarDigital Library
K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. http://www.robots.ox.ac.uk/vgg/research/very_deep/.Google Scholar
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. CoRR abs/1512.00567 (2015). arXiv:1512.00567 http://arxiv.org/abs/1512.00567Google Scholar
Asser N. Tantawi. 2015. On biasing towards optimized application placement in the cloud. In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2015 IEEE 23rd International Symposium on.Google Scholar
Asser N. Tantawi. 2016. Solution biasing for optimized cloud workload placement. In Autonomic Computing (ICAC), 2016 IEEE 13th International Conference on.Google ScholarCross Ref
Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-scale Advanced Analytics. In 13th Usenix Conference on Networked Systems Design and Implementation (NSDI'16). USENIX Association, 363--378.Google Scholar
Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. 2008. Proactive Process-level Live Migration in HPC Environments. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 43, 12 pages. http://dl.acm.org/citation.cfm?id=1413370.1413414Google Scholar
Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, et al. 2015. REEF: Retainable Evaluator Execution Framework. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1343--1355.Google ScholarDigital Library
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 595--610. https://www.usenix.org/conference/osdi18/presentation/xiaoGoogle ScholarDigital Library
Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data.. In KDD, Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams (Eds.). ACM, 1335--1344. http://dblp.uni-trier.de/db/conf/kdd/kdd2015.html#XingHDKWLZXKY15Google Scholar
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zahariaGoogle ScholarDigital Library
Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J. Freedman. 2017. SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. In ACM Symposium on Cloud Computing. 390--404. https://doi.org/10.1145/3127479.3127490Google Scholar
K. Zhang, S. Alqahtani, and M. Demirbas. 2017. A Comparison of Distributed Machine Learning Platforms. In 2017 26th International Conference on Computer Communication and Networks (ICCCN). 1--9. https://doi.org/10.1109/ICCCN.2017.8038464Google Scholar

Index Terms

FfDL: A Flexible Multi-tenant Deep Learning Platform

Recommendations

Evaluation of gang scheduling performance and cost in a cloud computing system

Cloud Computing refers to the notion of outsourcing on-site available services, computational facilities, or data storage to an off-site, location-transparent centralized facility or "Cloud." Gang Scheduling is an efficient job scheduling algorithm for ...
Read More
Performance and cost evaluation of Gang Scheduling in a Cloud Computing system with job migrations and starvation handling
ISCC '11: Proceedings of the 2011 IEEE Symposium on Computers and Communications

Cloud Computing is an emerging technology in the area of parallel and distributed computing. Clouds consist of a collection of virtualized resources, which include both computational and storage facilities that can be provisioned on demand, depending on ...
Read More
Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques
IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed Processing

Two different approaches have been commonly used to address problems associated with space sharing scheduling strategies: (a) augmenting space sharing with backfilling, which performs out of order job scheduling; and (b) augmenting space sharing with ...
Read More

Reviews

Reviewer: Brijendra Singh

Cloud computing is becoming the preferred choice in many domains due to availability, flexibility, and scalability, as well as many other reasons. However, this also brings challenges, that is, adapting cloud computing to specific domains like deep learning (DL), at large scale, for maintaining and executing learning jobs in the cloud environment. Precisely written, this paper covers a flexible DL platform used at IBM. This platform has been open sourced to the community as well. The authors clearly outline the challenges faced during installation, configuration, and fault tolerance for DL infrastructure. At the same time, they highlight key shortcomings for managing workloads and the need for a middleware platform, specialized "to support the distributed training of DL models in the cloud." This paper captures the architectural design and components of FfDL, which is "a cloud-hosted and multi-tenant dependable distributed DL platform used to train DL models at IBM." A study of the performance overhead on running bare metal hardware versus a cloud-hosted environment and failure analysis reveals details of various components in the running platform. This study should interest those involved in DL infrastructure setup and enhancement. It clearly sets the foundation for further study and research to make specialized domains more robust, reliable, and distributed with the highest performance hosted in a cloud environment.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

Middleware '19: Proceedings of the 20th International Middleware Conference
December 2019
342 pages
ISBN:9781450370097
DOI:10.1145/3361525

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 December 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cluster scheduling
deep learning platform
fault tolerance
gang scheduling
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate203of948submissions,21%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 329
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FfDL: A Flexible Multi-tenant Deep Learning Platform

Middleware '19: Proceedings of the 20th International Middleware Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluation of gang scheduling performance and cost in a cloud computing system

Performance and cost evaluation of Gang Scheduling in a Cloud Computing system with job migrations and starvation handling

Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

FfDL: A Flexible Multi-tenant Deep Learning Platform

Middleware '19: Proceedings of the 20th International Middleware Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluation of gang scheduling performance and cost in a cloud computing system

Performance and cost evaluation of Gang Scheduling in a Cloud Computing system with job migrations and starvation handling

Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media