Article

Benchmarking declarative approximate selection predicates

Authors:
Amit Chandel

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

,
Oktie Hassanzadeh

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

,
Nick Koudas

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

,
Mohammad Sadoghi

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

,
Divesh Srivastava

AT&T Labs-Research, Florham Park, NJ

AT&T Labs-Research, Florham Park, NJ
View Profile

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataJune 2007Pages 353–364https://doi.org/10.1145/1247480.1247521

Published:11 June 2007Publication History

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Pages 353–364

ABSTRACT

Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Over the last few years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc) and have been fully expressed using declarative SQL statements. In this paper we propose new similarity predicates along with their declarative realization, based on notions of probabilistic information retrieval. In particular we show how language models and hidden Markov models can be utilized as similarity predicates for data quality and present their full declarative instantiation. We also show how other scoring methods from information retrieval, can be utilized in a similar setting. We then present full declarative specifications of previously proposed similarity predicates in the literature, grouping them into classes according to their primary characteristics. Finally, we present a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations. We quantify both their runtime performance as well as their accuracy for several types of common quality problems encountered in operational databases.

References

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB '02. Google ScholarDigital Library
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB '06. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 2000. Google ScholarDigital Library
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD '03. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE '06. Google ScholarDigital Library
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD '98. Google ScholarDigital Library
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb '03.Google Scholar
J. B. Copas and F. J. Hilton. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society, 1990.Google Scholar
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969.Google Scholar
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01. Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB '01. Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for web data integration. In WWW 2003: 90--101. Google ScholarDigital Library
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. Google ScholarDigital Library
M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 1998. Google ScholarDigital Library
M. A. Jaro. Advances in record linkage methodology as applied to matching the 1985 census of tampa. Journal of the American Statistical Association, 1984.Google Scholar
N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In VLDB '04. Google ScholarDigital Library
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. Google ScholarDigital Library
C. D. Manning and H. Schütze. Foundations of statistical natural language processing, 1999. Google ScholarDigital Library
D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In SIGIR '99. Google ScholarDigital Library
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR '98. Google ScholarDigital Library
L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, 1989.Google ScholarCross Ref
S. Robertson. Understanding inverse document frequency: on theoretical arguments. Journal of Documentation, 2004.Google ScholarCross Ref
S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In TREC '95.Google Scholar
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988.Google Scholar
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1986. Google ScholarDigital Library
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD '04. Google ScholarDigital Library
W. E. Winkler. The state of record linkage and current research problems. US Bureau of the Census, 1999.Google Scholar

Index Terms

Benchmarking declarative approximate selection predicates
1. Information systems
  1. Information retrieval

Recommendations

Comparing NoSQL MongoDB to an SQL DB
ACMSE '13: Proceedings of the 51st ACM Southeast Conference

NoSQL database solutions are becoming more and more prevalent in a world currently dominated by SQL relational databases. NoSQL databases were designed to provide database solutions for large volumes of data that is not structured. However, the ...
Read More
A tactic language for declarative proofs
ITP'10: Proceedings of the First international conference on Interactive Theorem Proving

Influenced by the success of the Mizar system many declarative proof languages have been developed in the theorem prover community, as declarative proofs are more readable, easier to modify and to maintain than their procedural counterparts. However, ...
Read More
Making standard ML a practical database programming language
ICFP '11: Proceedings of the 16th ACM SIGPLAN international conference on Functional programming

Integrating a database query language into a programming language is becoming increasingly important in recently emerging high-level cloud computing and other applications, where efficient and sophisticated data manipulation is required during ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
June 2007
1210 pages
ISBN:9781595936868
DOI:10.1145/1247480
General Chairs:
Lizhu Zhou
Tsinghua University, China
,
Tok Wang Ling
National University of Singapore, Singapore
,
Program Chair:
Beng Chin Ooi
National University of Singapore, Singapore
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SQL
accuracy
data cleaning
declarative data quality
performance
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 62
  Total Citations
  View Citations
- 53
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Benchmarking declarative approximate selection predicates

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comparing NoSQL MongoDB to an SQL DB

A tactic language for declarative proofs

Making standard ML a practical database programming language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Benchmarking declarative approximate selection predicates

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comparing NoSQL MongoDB to an SQL DB

A tactic language for declarative proofs

Making standard ML a practical database programming language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media