ABSTRACT
Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale "on-premise" and "cloud-hosted" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale.
This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.
- Vinay Amatya, Abhinav Vishnu, Charles Siegel, and Jeff Daily. 2017. What does fault tolerant Deep Learning need from MPI?. In EuroMPI/USA.Google Scholar
- Amazon Web Services. 2017. Amazon Sagemaker. https://aws.amazon.com/sagemaker/.Google Scholar
- Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 1387--1395. https://doi.org/10.1145/3097983.3098021Google ScholarDigital Library
- Mahdi Nazari Cheraghlou, Ahmad Khademzadeh, and Majid Haghparast. 2016. A survey of fault tolerance architecture in cloud computing. J. Network and Computer Applications 61 (2016), 81--92.Google ScholarDigital Library
- CoreOS Inc. 2018. The ETCD Key Value Store. https://coreos.com/etcd/.Google Scholar
- Docker. 2019. Docker: Enterprise Application Container Platform. https://www.docker.com/.Google Scholar
- Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad van der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The International Exascale Software Project roadmap. The International Journal of High Performance Computing Applications 25, 1 (2011), 3--60. https://doi.org/10.1177/1094342010391989 arXiv:https://doi.org/10.1177/1094342010391989Google ScholarDigital Library
- Jeffrey Dunn. 2016. Introducing FBLearner Flow: Facebook's AI backbone. https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/.Google Scholar
- Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Super computing 65, 3 (01 Sep 2013), 1302--1326. https://doi.org/10.1007/s11227-013-0884-0Google Scholar
- Apache Software Foundation. 2019. Apache Mesos. http://mesos.apache.org.Google Scholar
- James Fox, Yiming Zou, and Judy Qiu. 2016. Software Frameworks for Deep Learning at Scale.Google Scholar
- Google. 2019. Kubernetes: Production Grade Container Orchestration.Google Scholar
- Google Inc. 2018. Google Cloud Machine Learning Engine. https://cloud.google.com/ml-engine/.Google Scholar
- Haryadi S. Gunawi, Thanh Do, Joseph M. Hellerstein, Ion Stoica, Dhruba Borthakur, and Jesse Robbins. 2011. Failure as a Service (FaaS): A Cloud Service for Large-Scale, Online Failure Drills. Technical Report UCB/EECS-2011-87. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-87.htmlGoogle Scholar
- Moin Hasan and Major Singh Goraya. 2018. Fault tolerance in cloud computing environment: A systematic survey. Computers in Industry 99 (2018), 156--172. https://doi.org/10.1016/j.compind.2018.03.027Google ScholarCross Ref
- Jeremy Hermann and Mike Del Baso. 2017. Meet Michelangelo: Uber's Machine Learning Platform. https://eng.uber.com/michelangelo/.Google Scholar
- IBM Corporation. 2018. IBM Watson Machine Learning. https://developer.ibm.com/clouddataservices/docs/ibm-watson-machine-learning/.Google Scholar
- IBM Inc. 2018. The IBM Cloud. https://www.ibm.com/cloud/bare-metal-servers.Google Scholar
- Google Inc. 2018. Tensorflow: An open-source machine learning framework for everyone. https://www.tensorflow.org/.Google Scholar
- Google Inc. 2018. TensorFlow CNN Benchmarks. https://www.nytimes.com/2018/04/06/opinion/sunday/germs-microbes-processed-foods.html.Google Scholar
- NVIDIA Inc. 2018. NVIDIA DGX-1: Essential Instrument Of AI Research. https://www.nvidia.com/en-us/data-center/dgx-1/.Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).Google Scholar
- Justin C. Johnson. 2018. CNN Benchmarks. https://github.com/jcjohnson/cnn-benchmarks.Google Scholar
- Shaoqing Ren Kaiming He, Xiangyu Zhang and Jian Sun. 2015. Deep Residual Networks. https://github.com/KaimingHe/deep-residual-networks.Google Scholar
- Israel Koren and C. Mani Krishna. 2007. Fault-Tolerant Systems (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Google Scholar
- Tim Kraska, Ameet Talwalkar, and John Duchi. 2013. Mlbase: A distributed machine-learning system. In In CIDR.Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. In Nature, Vol. 521.436--444.Google ScholarCross Ref
- Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 583--598. http://dl.acm.org/citation.cfm?id=2685048.2685095Google ScholarDigital Library
- Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (April 2012), 716--727. https://doi.org/10.14778/2212351.2212354Google ScholarDigital Library
- Microsoft Azure. 2018. Microsoft Azure Machine Learning. https://azure.microsoft.com/en-us/overview/machine-learning/.Google Scholar
- Mongo Inc. 2018. MongoDB. https://www.mongodb.com/.Google Scholar
- Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 439--455. https://doi.org/10.1145/2517349.2522738Google ScholarDigital Library
- NVIDIA Inc. 2014. TESLA K80. https://www.nvidia.com/en-us/data-center/teslak80/.Google Scholar
- NVIDIA Inc. 2018. TESLA P100: Infinite Compute Power for the Modern Data Center. http://www.nvidia.com/object/tesla-p100.html.Google Scholar
- Xinghao Pan, Shivaram Venkataraman, Zizheng Tai, and Joseph Gonzalez. 2017. Hemingway: Modeling Distributed Optimization Algorithms. arXiv preprint arXiv: 1702.05865 (2017).Google Scholar
- Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch, and Gregory R. Ganger. 2018. 3Sigma: Distribution-based Cluster Scheduling for Runtime Uncertainty. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). ACM, New York, NY, USA, Article 2, 17 pages. https://doi.org/10.1145/3190508.3190515Google Scholar
- Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). ACM. https://doi.org/10.1145/3190508.3190517Google ScholarDigital Library
- Altino M. Sampaio and Jorge G. Barbosa. 2017. A Comparative Cost Study of Fault-Tolerant Techniques for Availability on the Cloud. In Ambient Intelligence - Software and Applications - 8th International Symposium on Ambient Intelligence, ISAmI 2017, Porto, Portugal, June 21-23, 2017. 263--268. https://doi.org/10.1007/978-3-319-61118-1_32Google Scholar
- Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799 http://arxiv.org/abs/1802.05799Google Scholar
- Yogesh Sharma, Bahman Javadi, Weisheng Si, and Daniel Sun. 2016. Reliability and Energy Efficiency in Cloud Computing Systems. J. Netw. Comput. Appl. 74, C (Oct. 2016), 66--85. https://doi.org/10.1016/j.jnca.2016.08.010Google ScholarDigital Library
- K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. http://www.robots.ox.ac.uk/vgg/research/very_deep/.Google Scholar
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. CoRR abs/1512.00567 (2015). arXiv:1512.00567 http://arxiv.org/abs/1512.00567Google Scholar
- Asser N. Tantawi. 2015. On biasing towards optimized application placement in the cloud. In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2015 IEEE 23rd International Symposium on.Google Scholar
- Asser N. Tantawi. 2016. Solution biasing for optimized cloud workload placement. In Autonomic Computing (ICAC), 2016 IEEE 13th International Conference on.Google ScholarCross Ref
- Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-scale Advanced Analytics. In 13th Usenix Conference on Networked Systems Design and Implementation (NSDI'16). USENIX Association, 363--378.Google Scholar
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. 2008. Proactive Process-level Live Migration in HPC Environments. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 43, 12 pages. http://dl.acm.org/citation.cfm?id=1413370.1413414Google Scholar
- Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, et al. 2015. REEF: Retainable Evaluator Execution Framework. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1343--1355.Google ScholarDigital Library
- Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 595--610. https://www.usenix.org/conference/osdi18/presentation/xiaoGoogle ScholarDigital Library
- Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data.. In KDD, Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams (Eds.). ACM, 1335--1344. http://dblp.uni-trier.de/db/conf/kdd/kdd2015.html#XingHDKWLZXKY15Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zahariaGoogle ScholarDigital Library
- Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J. Freedman. 2017. SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. In ACM Symposium on Cloud Computing. 390--404. https://doi.org/10.1145/3127479.3127490Google Scholar
- K. Zhang, S. Alqahtani, and M. Demirbas. 2017. A Comparison of Distributed Machine Learning Platforms. In 2017 26th International Conference on Computer Communication and Networks (ICCCN). 1--9. https://doi.org/10.1109/ICCCN.2017.8038464Google Scholar
Index Terms
- FfDL: A Flexible Multi-tenant Deep Learning Platform
Recommendations
Evaluation of gang scheduling performance and cost in a cloud computing system
Cloud Computing refers to the notion of outsourcing on-site available services, computational facilities, or data storage to an off-site, location-transparent centralized facility or "Cloud." Gang Scheduling is an efficient job scheduling algorithm for ...
Performance and cost evaluation of Gang Scheduling in a Cloud Computing system with job migrations and starvation handling
ISCC '11: Proceedings of the 2011 IEEE Symposium on Computers and CommunicationsCloud Computing is an emerging technology in the area of parallel and distributed computing. Clouds consist of a collection of virtualized resources, which include both computational and storage facilities that can be provisioned on demand, depending on ...
Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques
IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed ProcessingTwo different approaches have been commonly used to address problems associated with space sharing scheduling strategies: (a) augmenting space sharing with backfilling, which performs out of order job scheduling; and (b) augmenting space sharing with ...
Comments