skip to main content
10.1145/3076246.3076252acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper
Public Access

On Model Discovery For Hosted Data Science Projects

Authors Info & Claims
Published:14 May 2017Publication History

ABSTRACT

Alongside developing systems for scalable machine learning and collaborative data science activities, there is an increasing trend toward publicly shared data science projects, hosted in general or dedicated hosting services, such as GitHub and DataHub. The artifacts of the hosted projects are rich and include not only text files, but also versioned datasets, trained models, project documents, etc. Under the fast pace and expectation of data science activities, model discovery, i.e., finding relevant data science projects to reuse, is an important task in the context of data management for end-to-end machine learning. In this paper, we study the task and present the ongoing work on ModelHub Discovery, a system for finding relevant models in hosted data science projects. Instead of prescribing a structured data model for data science projects, we take an information retrieval approach by decomposing the discovery task into three major steps: project query and matching, model comparison and ranking, and processing and building ensembles with returned models. We describe the motivation and desiderata, propose techniques, and present opportunities and challenges for model discovery for hosted data science projects.

References

  1. Google Prediction API. https://cloud.google.com/predictionGoogle ScholarGoogle Scholar
  2. Roger Barga, Valentine Fontama, Wee Hyong Tok, and Luis Cabrera-Cordon. 2015. Predictive analytics with Microsoft Azure machine learning. Springer. Google ScholarGoogle ScholarCross RefCross Ref
  3. Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In WWW 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ingwer Borg and Patrick JF Groenen. 2005. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media.Google ScholarGoogle Scholar
  5. Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In ICML 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fernando Seabra Chirigati, Dennis E. Shasha, and Juliana Freire. 2013. ReproZip: Using Provenance to Support Computational Reproducibility. In TaPP'13.Google ScholarGoogle Scholar
  7. Vasant Dhar. 2013. Data science and prediction. Commun. of the ACM (2013).Google ScholarGoogle Scholar
  8. Anant Bhardwaj et al. 2015. DataHub: Collaborative Data Science & Dataset Version Management at Scale. In CIDR 2015.Google ScholarGoogle Scholar
  9. Alain Biem et al. 2015. Towards Cognitive Automation of Data Science. In AAAI.Google ScholarGoogle Scholar
  10. Alon Halevy et al. 2016. Goods: Organizing Google's Datasets. In SIGMOD 2016.Google ScholarGoogle Scholar
  11. Arun Kumar et al. 2015. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record 44, 4 (2015), 17--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Daniel Crankshaw et al. 2015. The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. In CIDR 2015.Google ScholarGoogle Scholar
  13. Eser Kandogan et al. 2015. LabBook: Metadata-driven social collaborative data analysis. In IEEE BigData 2015.Google ScholarGoogle Scholar
  14. Joseph M. Hellerstein et al. 2017. Ground: A Data Context Service. In CIDR 2017.Google ScholarGoogle Scholar
  15. Leonardo Murta et al. 2014. noWorkflow: capturing and analyzing provenance of scripts. In IPAW 2014.Google ScholarGoogle Scholar
  16. Michael Feldman et al. 2015. Certifying and removing disparate impact. In KDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Manasi Vartak et al. 2016. ModelDB: a system for machine learning model management. In HILDA'16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xiangrui Meng et al. 2016. Mllib: Machine learning in apache spark. JMLR 17, 34 (2016), 1--7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M. Patel. 2017. Ava: From Data to Insights Through Conversation. In CIDR 2017.Google ScholarGoogle Scholar
  20. Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225--331.Google ScholarGoogle Scholar
  21. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Google ScholarGoogle ScholarCross RefCross Ref
  22. Hui Miao, Amit Chavan, and Amol Deshpande. 2017. ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows. In HILDA'17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hui Miao, Ang Li, Larry S. Davis, and Amol Deshpande. 2017. Towards Unified Data and Lifecycle Management for Deep Learning. In ICDE 2017.Google ScholarGoogle Scholar
  24. Kaspar Riesen. 2016. Structural Pattern Recognition with Graph Edit Distance: Approximation Algorithms and Applications. Springer.Google ScholarGoogle Scholar
  25. Ashish Sabharwal, Horst Samulowitz, and Gerald Tesauro. 2016. Selecting Near-Optimal Learners via Incremental Data Allocation. In AAAI 2016.Google ScholarGoogle Scholar
  26. Brandyn A. White, Andrew E. Miller, and Larry S. Davis. 2012. Classifier-as-a-Service: Online Query of Cascades and Operating Points. In Workshop on Big Data Meets Computer Vision 2012, co-located with NIPS 2012. 1--5.Google ScholarGoogle Scholar
  27. Ce Zhang, Arun Kumar, and Christopher Ré. 2014. Materialization optimizations for feature selection workloads. In SIGMOD 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zhi-Hua Zhou. 2012. Ensemble methods: foundations and algorithms. CRC press.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    DEEM'17: Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning
    May 2017
    36 pages
    ISBN:9781450350266
    DOI:10.1145/3076246

    Copyright © 2017 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 May 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate23of37submissions,62%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader