research-article

Public Access

ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning

Authors:

Sanjay Krishnan,

Michael J. Franklin,

Eugene WuAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 2117 - 2120

https://doi.org/10.1145/2882903.2899409

Published: 26 June 2016 Publication History

Abstract

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem.

References

[1]

U. Berkeley. Berkeley data analytics stack. https://amplab.cs.berkeley.edu/software/.

[2]

A. Crotty, A. Galakatos, and T. Kraska. Tupleware: Distributed machine learning on small clusters. In IEEE Data Eng. Bull., 2014.

[3]

C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.

Digital Library

[4]

D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. In VLDB, 2015.

Digital Library

[5]

G. Inc. Tensorflow. https://www.tensorflow.org/.

[6]

S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. In VLDB, 2015.

Digital Library

[7]

S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3):59--75, 2015.

[8]

S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. http://arxiv.org/abs/1601.03797v1.

[9]

J. Mahler, S. Krishnan, M. Laskey, S. Sen, A. Murali, B. Kehoe, S. Patil, J. Wang, M. Franklin, P. Abbeel, and K. Y. Goldberg. Learning accurate kinematic control of cable-driven surgical robots using data cleaning and gaussian process regression. In CASE, 2014.

[10]

P. Spark. Pyspark. https://spark.apache.org/docs/0.9.0/python-programming-guide.html.

[11]

L. Van der Maaten and G. Hinton. Visualizing data using t-sne. In JMLR, 2008.

[12]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014.

Digital Library

[13]

H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli. Is feature selection secure against training data poisoning? In ICML, 2015.

Digital Library

[14]

M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.

Digital Library

[15]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. In VLDB, 2011.

Digital Library

Cited By

Tiukhova ESalcuni AOguz CForte FBaesens BSnoeck M(2025)IML4DQ: Interactive Machine Learning for Data Quality with Applications in Credit RiskCooperative Information Systems10.1007/978-3-031-81375-7_18(315-326)Online publication date: 14-Feb-2025
https://doi.org/10.1007/978-3-031-81375-7_18
Sindhu PPeruri GYalavarthi M(2024)An empirically based object-oriented testing using Machine learningEAI Endorsed Transactions on Internet of Things10.4108/eetiot.534410Online publication date: 8-Mar-2024
https://doi.org/10.4108/eetiot.5344
(2024)Work order prioritization using neural networks to improve building operationJournal of Information Technology in Construction10.36680/j.itcon.2024.01629(324-346)Online publication date: 18-Apr-2024
https://doi.org/10.36680/j.itcon.2024.016
Show More Cited By

Index Terms

ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

Machine Learning and Data Cleaning: Which Serves the Other?
The last few years witnessed significant advances in building automated or semi-automated data quality, data cleaning and data integration systems powered by machine learning (ML). In parallel, large deployment of ML systems in business, science, ...
ActiveClean: interactive data cleaning for statistical modeling

Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly ...
Data cleaning and machine learning: a systematic literature review
Abstract
Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
2,239
Total Downloads

Downloads (Last 12 months)393
Downloads (Last 6 weeks)48

Reflects downloads up to 22 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tiukhova ESalcuni AOguz CForte FBaesens BSnoeck M(2025)IML4DQ: Interactive Machine Learning for Data Quality with Applications in Credit RiskCooperative Information Systems10.1007/978-3-031-81375-7_18(315-326)Online publication date: 14-Feb-2025
https://doi.org/10.1007/978-3-031-81375-7_18
Sindhu PPeruri GYalavarthi M(2024)An empirically based object-oriented testing using Machine learningEAI Endorsed Transactions on Internet of Things10.4108/eetiot.534410Online publication date: 8-Mar-2024
https://doi.org/10.4108/eetiot.5344
(2024)Work order prioritization using neural networks to improve building operationJournal of Information Technology in Construction10.36680/j.itcon.2024.01629(324-346)Online publication date: 18-Apr-2024
https://doi.org/10.36680/j.itcon.2024.016
Shah VParashos TKumar A(2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648178
Hu GZeng XYu WPeng MYuan MDuan L(2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686664
Ahmad EGeorge VSoman DKanthaliya RKumar DPateriya V(2024)Integrated Tool for Cleaning Bulk Solar Power Data2024 Second International Conference on Smart Technologies for Power and Renewable Energy (SPECon)10.1109/SPECon61254.2024.10537256(1-5)Online publication date: 2-Apr-2024
https://doi.org/10.1109/SPECon61254.2024.10537256
Gonzalez Santacruz ERomero DNoguez JWuest T(2024)Integrated quality 4.0 framework for quality improvement based on Six Sigma and machine learning techniques towards zero-defect manufacturingThe TQM Journal10.1108/TQM-11-2023-0361Online publication date: 28-Mar-2024
https://doi.org/10.1108/TQM-11-2023-0361
Krichen MAbdalzaher M(2024)Performance enhancement of artificial intelligence: A surveyJournal of Network and Computer Applications10.1016/j.jnca.2024.104034232(104034)Online publication date: Dec-2024
https://doi.org/10.1016/j.jnca.2024.104034
Tawakuli AHavers BGulisano VKaiser DEngel T(2024)Survey:Time-series data preprocessing: A survey and an empirical analysisJournal of Engineering Research10.1016/j.jer.2024.02.018Online publication date: Mar-2024
https://doi.org/10.1016/j.jer.2024.02.018
Krafft THauer MZweig K(2024)Black-Box Testing and Auditing of Bias in ADM SystemsMinds and Machines10.1007/s11023-024-09666-034:2Online publication date: 25-May-2024
https://doi.org/10.1007/s11023-024-09666-0
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten