skip to main content
10.1145/1247480.1247521acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Benchmarking declarative approximate selection predicates

Published:11 June 2007Publication History

ABSTRACT

Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Over the last few years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc) and have been fully expressed using declarative SQL statements. In this paper we propose new similarity predicates along with their declarative realization, based on notions of probabilistic information retrieval. In particular we show how language models and hidden Markov models can be utilized as similarity predicates for data quality and present their full declarative instantiation. We also show how other scoring methods from information retrieval, can be utilized in a similar setting. We then present full declarative specifications of previously proposed similarity predicates in the literature, grouping them into classes according to their primary characteristics. Finally, we present a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations. We quantify both their runtime performance as well as their accuracy for several types of common quality problems encountered in operational databases.

References

  1. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB '02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB '06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD '03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE '06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD '98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb '03.Google ScholarGoogle Scholar
  8. J. B. Copas and F. J. Hilton. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society, 1990.Google ScholarGoogle Scholar
  9. I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969.Google ScholarGoogle Scholar
  10. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB '01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for web data integration. In WWW 2003: 90--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. A. Jaro. Advances in record linkage methodology as applied to matching the 1985 census of tampa. Journal of the American Statistical Association, 1984.Google ScholarGoogle Scholar
  16. N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In VLDB '04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. D. Manning and H. Schütze. Foundations of statistical natural language processing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In SIGIR '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR '98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  22. S. Robertson. Understanding inverse document frequency: on theoretical arguments. Journal of Documentation, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In TREC '95.Google ScholarGoogle Scholar
  24. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988.Google ScholarGoogle Scholar
  25. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD '04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. E. Winkler. The state of record linkage and current research problems. US Bureau of the Census, 1999.Google ScholarGoogle Scholar

Index Terms

  1. Benchmarking declarative approximate selection predicates

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
      June 2007
      1210 pages
      ISBN:9781595936868
      DOI:10.1145/1247480
      • General Chairs:
      • Lizhu Zhou,
      • Tok Wang Ling,
      • Program Chair:
      • Beng Chin Ooi

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader