ABSTRACT
Alongside developing systems for scalable machine learning and collaborative data science activities, there is an increasing trend toward publicly shared data science projects, hosted in general or dedicated hosting services, such as GitHub and DataHub. The artifacts of the hosted projects are rich and include not only text files, but also versioned datasets, trained models, project documents, etc. Under the fast pace and expectation of data science activities, model discovery, i.e., finding relevant data science projects to reuse, is an important task in the context of data management for end-to-end machine learning. In this paper, we study the task and present the ongoing work on ModelHub Discovery, a system for finding relevant models in hosted data science projects. Instead of prescribing a structured data model for data science projects, we take an information retrieval approach by decomposing the discovery task into three major steps: project query and matching, model comparison and ranking, and processing and building ensembles with returned models. We describe the motivation and desiderata, propose techniques, and present opportunities and challenges for model discovery for hosted data science projects.
- Google Prediction API. https://cloud.google.com/predictionGoogle Scholar
- Roger Barga, Valentine Fontama, Wee Hyong Tok, and Luis Cabrera-Cordon. 2015. Predictive analytics with Microsoft Azure machine learning. Springer. Google ScholarCross Ref
- Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In WWW 2007. Google ScholarDigital Library
- Ingwer Borg and Patrick JF Groenen. 2005. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media.Google Scholar
- Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In ICML 2004. Google ScholarDigital Library
- Fernando Seabra Chirigati, Dennis E. Shasha, and Juliana Freire. 2013. ReproZip: Using Provenance to Support Computational Reproducibility. In TaPP'13.Google Scholar
- Vasant Dhar. 2013. Data science and prediction. Commun. of the ACM (2013).Google Scholar
- Anant Bhardwaj et al. 2015. DataHub: Collaborative Data Science & Dataset Version Management at Scale. In CIDR 2015.Google Scholar
- Alain Biem et al. 2015. Towards Cognitive Automation of Data Science. In AAAI.Google Scholar
- Alon Halevy et al. 2016. Goods: Organizing Google's Datasets. In SIGMOD 2016.Google Scholar
- Arun Kumar et al. 2015. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record 44, 4 (2015), 17--22. Google ScholarDigital Library
- Daniel Crankshaw et al. 2015. The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. In CIDR 2015.Google Scholar
- Eser Kandogan et al. 2015. LabBook: Metadata-driven social collaborative data analysis. In IEEE BigData 2015.Google Scholar
- Joseph M. Hellerstein et al. 2017. Ground: A Data Context Service. In CIDR 2017.Google Scholar
- Leonardo Murta et al. 2014. noWorkflow: capturing and analyzing provenance of scripts. In IPAW 2014.Google Scholar
- Michael Feldman et al. 2015. Certifying and removing disparate impact. In KDD. Google ScholarDigital Library
- Manasi Vartak et al. 2016. ModelDB: a system for machine learning model management. In HILDA'16. Google ScholarDigital Library
- Xiangrui Meng et al. 2016. Mllib: Machine learning in apache spark. JMLR 17, 34 (2016), 1--7.Google ScholarDigital Library
- Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M. Patel. 2017. Ava: From Data to Insights Through Conversation. In CIDR 2017.Google Scholar
- Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225--331.Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Google ScholarCross Ref
- Hui Miao, Amit Chavan, and Amol Deshpande. 2017. ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows. In HILDA'17.Google ScholarDigital Library
- Hui Miao, Ang Li, Larry S. Davis, and Amol Deshpande. 2017. Towards Unified Data and Lifecycle Management for Deep Learning. In ICDE 2017.Google Scholar
- Kaspar Riesen. 2016. Structural Pattern Recognition with Graph Edit Distance: Approximation Algorithms and Applications. Springer.Google Scholar
- Ashish Sabharwal, Horst Samulowitz, and Gerald Tesauro. 2016. Selecting Near-Optimal Learners via Incremental Data Allocation. In AAAI 2016.Google Scholar
- Brandyn A. White, Andrew E. Miller, and Larry S. Davis. 2012. Classifier-as-a-Service: Online Query of Cascades and Operating Points. In Workshop on Big Data Meets Computer Vision 2012, co-located with NIPS 2012. 1--5.Google Scholar
- Ce Zhang, Arun Kumar, and Christopher Ré. 2014. Materialization optimizations for feature selection workloads. In SIGMOD 2014. Google ScholarDigital Library
- Zhi-Hua Zhou. 2012. Ensemble methods: foundations and algorithms. CRC press.Google Scholar
Recommendations
Isolating commodity hosted hypervisors with HyperLock
EuroSys '12: Proceedings of the 7th ACM european conference on Computer SystemsHosted hypervisors (e.g., KVM) are being widely deployed. One key reason is that they can effectively take advantage of the mature features and broad user bases of commodity operating systems. However, they are not immune to exploitable software bugs. ...
Fast networking with socket-outsourcing in hosted virtual machine environments
SAC '09: Proceedings of the 2009 ACM symposium on Applied ComputingThis paper proposes a novel method of achieving fast networking in hosted virtual machine (VM) environments. This method, called socket-outsourcing, replaces the socket layer in a guest operating system (OS) with the socket layer of the host OS. Socket-...
Comments