skip to main content
10.1145/2723372.2737786acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Query-Oriented Data Cleaning with Oracles

Published:27 May 2015Publication History

ABSTRACT

As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a database. Even more importantly, existing data cleaning tools do not typically address the problem of determining what information is missing from a database.

To overcome the limitations of existing data cleaning techniques, we present QOCO, a novel query-oriented system for cleaning data with oracles. Under this framework, incorrect (resp. missing) tuples are removed from (added to) the result of a query through edits that are applied to the underlying database, where the edits are derived by interacting with domain experts which we model as oracle crowds. We show that the problem of determining minimal interactions with oracle crowds to derive database edits for removing (adding) incorrect (missing) tuples to the result of a query is NP-hard in general and present heuristic algorithms that interact with oracle crowds. Finally, we implement our algorithms in our prototype system QOCO and show that it is effective and efficient through a comprehensive suite of experiments.

References

  1. Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment, 7(11), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, pages 241--252, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Arora, R. Ramakrishnan, W. G. Roth, P. Seshadri, and D. Srivastava. Explaining program execution in deductive systems. In DOOD, pages 101--119, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  4. I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Bidoit, M. Herschel, and K. Tzompanaki. Query-based why-not provenance with nedexplain. In EDBT, pages 145--156, 2014.Google ScholarGoogle Scholar
  6. M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. KDD, pages 39--48, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Buneman, J. Cheney, W. C. Tan, and S. Vansummeren. Curated databases. In ACM PODS, pages 1--12, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Buneman, S. Khanna, and W. C. Tan. On propagation of deletions and annotations through views. In PODS, pages 150--158, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Chapman and H. V. Jagadish. Why not? In SIGMOD, pages 523--534, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Charika, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. PODS, pages 268--279, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cia world fact book. https://www.cia.gov/library/publications/the-world-factbook/.Google ScholarGoogle Scholar
  14. Y. Cui and J. Widom. Run-time translation of view tuple deletions using data lineage. Technical Report 2001--24, Stanford InfoLab, 2001.Google ScholarGoogle Scholar
  15. Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM TODS, 25(2):179--227, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Dasu and T. Johnson. Exploratory data mining and data cleaning. In Wiley, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Domingos. Multi-relational record linkage. In In Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining. Citeseer, 2004.Google ScholarGoogle Scholar
  18. J. Edmonds and R. M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. JACM, 19(2):284--264, 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Evermanna and J. Fangb. Evaluating ontologies: Towards a cognitive measure of quality. Information Systems, 35(4):391--403, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Cerfix: A system for cleaning data with certain fixes. In PVLDB, pages 1375--1378, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. U. M. Fayyad. Mining databases: Towards algorithms for knowledge discovery. IEEE Data Eng. Bull., 21(1):39--48, 1998.Google ScholarGoogle Scholar
  23. U. Feige, M. Langberg, and K. Nissim. On the hardness of approximating NP witnesses. In K. Jansen and S. Khuller, editors, Approximation Algorithms for Combinatorial Optimization, volume 1913 of LNCS, pages 120--131. Springer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fifa official site. http://www.fifa.com/.Google ScholarGoogle Scholar
  25. Fifa trivia quizzes and games. http://www.sporcle.com/games/g/fifaworldcup.Google ScholarGoogle Scholar
  26. Fifa world cup trivia. http://en.trivia.fifa.com/.Google ScholarGoogle Scholar
  27. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In SIGMOD, pages 61--72, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Ganti and A. D. Sarma. Data Cleaning: A Practical Perspective. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Ghosh, N. Sharma, F. Benevenuto, N. Ganguly, and K. Gummadi. Cognos: crowdsourcing search for topic experts in microblogs. SIGIR, pages 575--590, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In ACM PODS, pages 31--40, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Z. He and E. Lo. Answering why-not questions on top-k queries. In ICDE, pages 750--761, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Herschel and M. A. Hernández. Explaining missing answers to SPJUA queries. PVLDB, 3(1):185--196, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Herschel, M. A. Hernández, and W. C. Tan. Artemis: A system for analyzing missing answers. PVLDB, 2(2):1550--1553, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non-answers to queries over extracted data. PVLDB, 1(1):736--747, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. N. Q. V. Hung, N. T. Tam, Z. Mikló, K. Aberer, A. Gal, and M. Weidlich. Pay-as-you-go reconciliation in schema matching networks. In ICDE, pages 220--231, 2014.Google ScholarGoogle Scholar
  36. P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. HCOMP, pages 64--67, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Jiao, J. Yan, H. Zhao, and W. Fan. Expertrank: An expert user ranking algorithm in online communities. NISS, pages 674--679, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Kanagal, J. Li, and A. Deshpande. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In ACM SIGMOD, pages 841--852. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. NIPS, pages 1953--1961, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computations, pages 85--103, Plenum Press, New York, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  41. B. Kimelfeld, J. Vondrák, and R. Williams. Maximizing conjunctive views in deletion propagation. In ACM PODS, pages 187--198, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 6(2):109--120, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R. Mccann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, pages 110--119, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. The complexity of causality and responsibility for query answers and non-answers. PVLDB, 4(1):34--45, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. J. Miller, L. M. Haas, and M. A. Hernández. Schema mapping as query discovery. In VLDB, pages 77--88, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In VLDB, volume 98, pages 24--27. Citeseer, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: Algorithms for filtering data with humans. In SIGMOD, pages 361--372, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. G. Parameswaran, S. Boyd, H. Garcia-Molina, A. Gupta, N. Polyzotis, and J. Widom. Optimal crowd-powered rating and filtering algorithms. VLDB, 7(9):685--696, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. H. Park, R. Pang, A. Parameswaran, H. Garcia-Molina, N. Polyzotis, and J. Widom. An overview of the deco system: data model and query language; query processing and optimization. In SIGMOD, pages 22--27, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In SIGMOD, pages 577--588, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. J. Raddick, G. Bracey, P. L. Gay, C. J. Lintott, P. Murray, K. Schawinski, A. S. Szalay, and J. Vandenberg. Galaxy zoo: exploring the motivations of citizen science volunteers. Astronomy Education Review, 9(1), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  52. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google ScholarGoogle Scholar
  53. V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: whom to trust when everyone lies a bit. ICML, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. C. Sapia, G. Höfling, M. Müller, C. Hausdorf, H. Stoyan, and U. Grimmer. On supporting the data warehouse design by data mining techniques. In Proc. GI-Workshop Data Mining and Data Warehousing, page 63. Citeseer, 1999.Google ScholarGoogle Scholar
  56. S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB, pages 42--53, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge unifying wordnet and wikipedia. In WWW, pages 697--706, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Q. T. Tran and C.-Y. Chan. How to conquer why-not questions. In SIGMOD Conference, pages 15--26, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Uniprot. http://www.uniprot.org/.Google ScholarGoogle Scholar
  61. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(10):1483--1494, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, pages 469--480, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Open football. https://github.com/openfootball/.Google ScholarGoogle Scholar
  64. World cup history. http://www.worldcup-history.com/.Google ScholarGoogle Scholar
  65. S. E. Whang, H. Garcia-Molina, and P. Lofgren. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. J. Xu, D. V. Kalashnikov, and S. Mehrotra. Query aware determinization of uncertain objects. IEEE Trans. Knowl. Data Eng., 27(1):207--221, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  67. C. J. Zhang, Z. Zhao, L. Chen, H. V. Jagadish, and C. C. Cao. Crowdmatcher: crowd-assisted schema matching. In SIGMOD, pages 721--724, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Query-Oriented Data Cleaning with Oracles

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader