ABSTRACT
Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components---a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt.
We present TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions.
We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis.
Supplemental Material
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning OSDI. 265--283.Google ScholarDigital Library
- Rami Abousleiman, Guangzhi Qu, and Osamah A. Rawashdeh. 2013. North Atlantic Right Whale Contact Call Detection. CoRR Vol. abs/1304.7851 (2013).Google Scholar
- Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. In DLRS. 7--10. Google ScholarDigital Library
- Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. RecSys. 191--198.Google Scholar
- Yann Dauphin, Razvan Pascanu, Caglar Gülccehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. CoRR Vol. abs/1406.2572 (2014).Google Scholar
- Philippe Flajolet, Éric Fusy, Olivier Gandouet, and et al. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm AOFA.Google Scholar
- Tim Kraska, Ameet Talwalkar, John C. Duchi, Rean Griffith, Michael J. Franklin, and Michael I. Jordan 2013. MLbase: A Distributed Machine-learning System. CIDR.Google Scholar
- Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB, Vol. 9, 12 (2016), 948--959. Google ScholarDigital Library
- Sara Landset, Taghi M. Khoshgoftaar, Aaron N. Richter, and Tawfiq Hasanin 2015. A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, Vol. 2, 1 (2015), 24. Google ScholarCross Ref
- Cheng Li, Yue Lu, Qiaozhu Mei, Dong Wang, and Sandeep Pandey 2015. Click-through Prediction for Advertising in Twitter Timeline KDD. 1959--1968.Google Scholar
- Jimmy J. Lin and Alek Kolcz 2012. Large-scale machine learning at twitter. In SIGMOD. 793--804. Google ScholarDigital Library
- H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica 2013. Ad Click Prediction: A View from the Trenches. In KDD. 1222--1230.Google Scholar
- Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2015. MLlib: Machine Learning in Apache Spark. CoRR Vol. abs/1505.06807 (2015).Google Scholar
- J.I. Munro and M.S. Paterson 1980. Selection and sorting with limited storage. Theoretical Computer Science Vol. 12, 3 (1980), 315--323. Google ScholarCross Ref
- Sinno Jialin Pan and Qiang Yang 2010. A Survey on Transfer Learning. IEEE Trans. on Knowl. and Data Eng. Vol. 22, 10 (Oct. 2010), 1345--1359. Google ScholarDigital Library
- D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Franccois Crespo, and Dan Dennison 2015. Hidden Technical Debt in Machine Learning Systems. NIPS. 2503--2511.Google Scholar
- Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, and Benjamin Recht. 2016. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. CoRR Vol. abs/1610.09451 (2016).Google Scholar
- Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei Zaharia. 2016. ModelDB: a system for machine learning model management HILDA@SIGMOD. 14.Google Scholar
- Cassandra Xia, Clemens Mewald, D. Sculley, David Soergel, George Roumpos, Heng-Tze Cheng, Illia Polosukhin, Jamie Alexander Smith, Jianwei Xie, Lichan Hong, Martin Wicke, Mustafa Ispir, Philip Daniel Tucker, Yuan Tang, and Zakaria Haque 2017. Train and Distribute: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks. KDD (under review).Google Scholar
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS. 3320--3328.Google Scholar
- Martin Zinkevich. 2016. Rules of Machine Learning. In NIPS Workshop on Reliable Machine Learning. Invited Talk.Google Scholar
Index Terms
- TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
Recommendations
Factorizing YAGO: scalable machine learning for linked data
WWW '12: Proceedings of the 21st international conference on World Wide WebVast amounts of structured information have been published in the Semantic Web's Linked Open Data (LOD) cloud and their size is still growing rapidly. Yet, access to this information via reasoning and querying is sometimes difficult, due to LOD's size, ...
Weak consistency and stochastic environments: harmonization of replicated machine learning models
PaPoC '16: Proceedings of the 2nd Workshop on the Principles and Practice of Consistency for Distributed DataMany machine learning (ML) models are of a stochastic nature. We aim to combine the principles of weak consistency with large scale distributed machine learning. We see interesting opportunities in this domain in (1) perceiving parallel ML algorithms ...
LightLDA: Big Topic Models on Modest Computer Clusters
WWW '15: Proceedings of the 24th International Conference on World Wide WebWhen building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-...
Comments