short-paper

Public Access

On Model Discovery For Hosted Data Science Projects

Authors:
Hui Miao

Department of Computer Science, University of Maryland

Department of Computer Science, University of Maryland
View Profile

,
Ang Li

Department of Computer Science, University of Maryland

Department of Computer Science, University of Maryland
View Profile

,
Larry S. Davis

Department of Computer Science, University of Maryland

Department of Computer Science, University of Maryland
View Profile

,
Amol Deshpande

Department of Computer Science, University of Maryland

Department of Computer Science, University of Maryland
View Profile

DEEM'17: Proceedings of the 1st Workshop on Data Management for End-to-End Machine LearningMay 2017Article No.: 6Pages 1–4https://doi.org/10.1145/3076246.3076252

Published:14 May 2017Publication History

DEEM'17: Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning

Pages 1–4

ABSTRACT

Alongside developing systems for scalable machine learning and collaborative data science activities, there is an increasing trend toward publicly shared data science projects, hosted in general or dedicated hosting services, such as GitHub and DataHub. The artifacts of the hosted projects are rich and include not only text files, but also versioned datasets, trained models, project documents, etc. Under the fast pace and expectation of data science activities, model discovery, i.e., finding relevant data science projects to reuse, is an important task in the context of data management for end-to-end machine learning. In this paper, we study the task and present the ongoing work on ModelHub Discovery, a system for finding relevant models in hosted data science projects. Instead of prescribing a structured data model for data science projects, we take an information retrieval approach by decomposing the discovery task into three major steps: project query and matching, model comparison and ranking, and processing and building ensembles with returned models. We describe the motivation and desiderata, propose techniques, and present opportunities and challenges for model discovery for hosted data science projects.

References

Google Prediction API. https://cloud.google.com/predictionGoogle Scholar
Roger Barga, Valentine Fontama, Wee Hyong Tok, and Luis Cabrera-Cordon. 2015. Predictive analytics with Microsoft Azure machine learning. Springer. Google ScholarCross Ref
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In WWW 2007. Google ScholarDigital Library
Ingwer Borg and Patrick JF Groenen. 2005. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media.Google Scholar
Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In ICML 2004. Google ScholarDigital Library
Fernando Seabra Chirigati, Dennis E. Shasha, and Juliana Freire. 2013. ReproZip: Using Provenance to Support Computational Reproducibility. In TaPP'13.Google Scholar
Vasant Dhar. 2013. Data science and prediction. Commun. of the ACM (2013).Google Scholar
Anant Bhardwaj et al. 2015. DataHub: Collaborative Data Science & Dataset Version Management at Scale. In CIDR 2015.Google Scholar
Alain Biem et al. 2015. Towards Cognitive Automation of Data Science. In AAAI.Google Scholar
Alon Halevy et al. 2016. Goods: Organizing Google's Datasets. In SIGMOD 2016.Google Scholar
Arun Kumar et al. 2015. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record 44, 4 (2015), 17--22. Google ScholarDigital Library
Daniel Crankshaw et al. 2015. The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. In CIDR 2015.Google Scholar
Eser Kandogan et al. 2015. LabBook: Metadata-driven social collaborative data analysis. In IEEE BigData 2015.Google Scholar
Joseph M. Hellerstein et al. 2017. Ground: A Data Context Service. In CIDR 2017.Google Scholar
Leonardo Murta et al. 2014. noWorkflow: capturing and analyzing provenance of scripts. In IPAW 2014.Google Scholar
Michael Feldman et al. 2015. Certifying and removing disparate impact. In KDD. Google ScholarDigital Library
Manasi Vartak et al. 2016. ModelDB: a system for machine learning model management. In HILDA'16. Google ScholarDigital Library
Xiangrui Meng et al. 2016. Mllib: Machine learning in apache spark. JMLR 17, 34 (2016), 1--7.Google ScholarDigital Library
Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M. Patel. 2017. Ava: From Data to Insights Through Conversation. In CIDR 2017.Google Scholar
Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225--331.Google Scholar
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Google ScholarCross Ref
Hui Miao, Amit Chavan, and Amol Deshpande. 2017. ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows. In HILDA'17.Google ScholarDigital Library
Hui Miao, Ang Li, Larry S. Davis, and Amol Deshpande. 2017. Towards Unified Data and Lifecycle Management for Deep Learning. In ICDE 2017.Google Scholar
Kaspar Riesen. 2016. Structural Pattern Recognition with Graph Edit Distance: Approximation Algorithms and Applications. Springer.Google Scholar
Ashish Sabharwal, Horst Samulowitz, and Gerald Tesauro. 2016. Selecting Near-Optimal Learners via Incremental Data Allocation. In AAAI 2016.Google Scholar
Brandyn A. White, Andrew E. Miller, and Larry S. Davis. 2012. Classifier-as-a-Service: Online Query of Cascades and Operating Points. In Workshop on Big Data Meets Computer Vision 2012, co-located with NIPS 2012. 1--5.Google Scholar
Ce Zhang, Arun Kumar, and Christopher Ré. 2014. Materialization optimizations for feature selection workloads. In SIGMOD 2014. Google ScholarDigital Library
Zhi-Hua Zhou. 2012. Ensemble methods: foundations and algorithms. CRC press.Google Scholar

Recommendations

Isolating commodity hosted hypervisors with HyperLock
EuroSys '12: Proceedings of the 7th ACM european conference on Computer Systems

Hosted hypervisors (e.g., KVM) are being widely deployed. One key reason is that they can effectively take advantage of the mature features and broad user bases of commodity operating systems. However, they are not immune to exploitable software bugs. ...
Read More
Fast networking with socket-outsourcing in hosted virtual machine environments
SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing

This paper proposes a novel method of achieving fast networking in hosted virtual machine (VM) environments. This method, called socket-outsourcing, replaces the socket layer in a guest operating system (OS) with the socket layer of the host OS. Socket-...
Read More
Discovery Science: 17th International Conference, DS 2014, Bled, Slovenia, October 8-10, 2014, Proceedings
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DEEM'17: Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning
May 2017
36 pages
ISBN:9781450350266
DOI:10.1145/3076246

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate23of37submissions,62%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 420
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On Model Discovery For Hosted Data Science Projects

DEEM'17: Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning

ABSTRACT

References

Cited By

Recommendations

Isolating commodity hosted hypervisors with HyperLock

Fast networking with socket-outsourcing in hosted virtual machine environments

Discovery Science: 17th International Conference, DS 2014, Bled, Slovenia, October 8-10, 2014, Proceedings

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On Model Discovery For Hosted Data Science Projects

DEEM'17: Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning

ABSTRACT

References

Cited By

Recommendations

Isolating commodity hosted hypervisors with HyperLock

Fast networking with socket-outsourcing in hosted virtual machine environments

Discovery Science: 17th International Conference, DS 2014, Bled, Slovenia, October 8-10, 2014, Proceedings

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media