skip to main content
10.1145/2882903.2899409acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning

Published:26 June 2016Publication History

ABSTRACT

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem.

References

  1. U. Berkeley. Berkeley data analytics stack. https://amplab.cs.berkeley.edu/software/.Google ScholarGoogle Scholar
  2. A. Crotty, A. Galakatos, and T. Kraska. Tupleware: Distributed machine learning on small clusters. In IEEE Data Eng. Bull., 2014.Google ScholarGoogle Scholar
  3. C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. In VLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Inc. Tensorflow. https://www.tensorflow.org/.Google ScholarGoogle Scholar
  6. S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. In VLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3):59--75, 2015.Google ScholarGoogle Scholar
  8. S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. http://arxiv.org/abs/1601.03797v1.Google ScholarGoogle Scholar
  9. J. Mahler, S. Krishnan, M. Laskey, S. Sen, A. Murali, B. Kehoe, S. Patil, J. Wang, M. Franklin, P. Abbeel, and K. Y. Goldberg. Learning accurate kinematic control of cable-driven surgical robots using data cleaning and gaussian process regression. In CASE, 2014.Google ScholarGoogle Scholar
  10. P. Spark. Pyspark. https://spark.apache.org/docs/0.9.0/python-programming-guide.html.Google ScholarGoogle Scholar
  11. L. Van der Maaten and G. Hinton. Visualizing data using t-sne. In JMLR, 2008.Google ScholarGoogle Scholar
  12. J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli. Is feature selection secure against training data poisoning? In ICML, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. In VLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
      June 2016
      2300 pages
      ISBN:9781450335317
      DOI:10.1145/2882903

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader