Article

Leveraging aggregate constraints for deduplication

Authors:

Surajit Chaudhuri,

Anish Das Sarma,

Venkatesh Ganti,

Raghav KaushikAuthors Info & Claims

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Pages 437 - 448

https://doi.org/10.1145/1247480.1247530

Published: 11 June 2007 Publication History

Abstract

We show that aggregate constraints (as opposed to pairwise constraints) that often arise when integrating multiple sources of data, can be leveraged to enhance the quality of deduplication. However, despite its appeal, we show that the problem is challenging, both semantically and computationally. We define a restricted search space for deduplication that is intuitive in our context and we solve the problem optimally for the restricted space. Our experiments on real data show that incorporating aggregate constraints significantly enhances the accuracy of deduplication.

References

[1]

The K-Means Clustering Algorithm. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html.

[2]

Association for computing machinery. http://www.acm.org.

[3]

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the 27th International Conference on Very Large Databases, 2002.

Digital Library

[4]

J. A. Aslam, K. Pelehov, and D. Rus. A practical clustering algorithm for static and dynamic information organization. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1999.

Digital Library

[5]

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Mach. Learn., 56(1-3):89--113, 2002.

Digital Library

[6]

I. Bhattacharya and L. Getoor. Collective Entity Resolution In Relational Data. In Data Engineering Bulletin, 2006.

[7]

M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 11, New York, NY, USA, 2004. ACM Press.

Digital Library

[8]

D. Bitton and D. DeWitt. Duplicate record elimination in large data files. ACM Transactions on database systems(TODS), 8(2), 1983.

Digital Library

[9]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.

Digital Library

[10]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD, June 2003.

Digital Library

[11]

S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In In proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, April 2005.

Digital Library

[12]

J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. In Information and Computation, 2005.

Digital Library

[13]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw Hill, 2001.

Digital Library

[14]

I. Davidson, K. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. In PKDD, 2006.

Digital Library

[15]

Dblp. http://www.informatik.uni-trier.de/ ley/db/index.html.

[16]

X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.

Digital Library

[17]

A. Elmagarmid, P. G. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. In Information Systems Working Papers, 2006.

[18]

I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.

[19]

M. R. Garey and D. S. Johnson. Computers and Intractability. W. H. Freeman and Company, 1979.

Digital Library

[20]

G. Greco, S. Greco, and E. Zumpano. A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Engineering, 15(6), 2003.

Digital Library

[21]

M. M. Halldorsson. Approximations of weighted independent set and hereditary subset problems. Journal of Graph Algorithms and Applications, 2000.

Digital Library

[22]

M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, pages 127--138, San Jose, CA, May 1995.

Digital Library

[23]

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

Digital Library

[24]

J. Wijsen. Condensed representation of database repairs for consistent query answering. In ICDT, 2003.

Digital Library

[25]

I. Knowledge Partners. Business rules applied. http://www.kpiusa.com.

[26]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.

Digital Library

[27]

Lavastorm. Making the case for automated revenue assurance solutions. http://www.lavastormtech.com.

[28]

A. McCallum, K. Nigam, and L. Unger. Efficient clustering of high-dimensional data sets with applications to reference matching. In Proceedings of the ACM SIGKDD international conference on knowledge discovery in databases, pages 169--178, San Francisco, CA, 2000.

Digital Library

[29]

A. Monge and C. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.

[30]

Trillium Inc. www.trilliumsoft.com/trilliumsoft.nsf.

[31]

A. K. H. Tung, R. T. Ng, L. V. S. Lakshmanan, and J. Han. Constraint-based clustering in large databases. In ICDT, 2001.

Digital Library

Cited By

Lee WTkachenko MLauw H(2023)Robust Bidirectional Poly-MatchingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326648035:10(10762-10774)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TKDE.2023.3266480
Abelló ACheney J(2023)Eris: efficiently measuring discord in multidimensional sourcesThe VLDB Journal10.1007/s00778-023-00810-333:2(399-423)Online publication date: 20-Sep-2023
https://doi.org/10.1007/s00778-023-00810-3
Li BTian WLi RXiao WFu ZYe XDuan RLi YXu Z(2022)Cross-domain Resemblance Detection based on Meta-learning for Cloud Storage2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894349(374-379)Online publication date: 11-Nov-2022
https://doi.org/10.1109/IPCCC55026.2022.9894349
Show More Cited By

Index Terms

Leveraging aggregate constraints for deduplication
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Flash-Based Storage Deduplication Techniques: A Survey

Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

June 2007

1210 pages

ISBN:9781595936868

DOI:10.1145/1247480

General Chairs:
Lizhu Zhou
Tsinghua University, China
,
Tok Wang Ling
National University of Singapore, Singapore
,
Program Chair:
Beng Chin Ooi
National University of Singapore, Singapore

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGMOD/PODS07

Sponsor:

SIGMOD/PODS07: International Conference on Management of Data

June 11 - 14, 2007

Beijing, China

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
52
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee WTkachenko MLauw H(2023)Robust Bidirectional Poly-MatchingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326648035:10(10762-10774)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TKDE.2023.3266480
Abelló ACheney J(2023)Eris: efficiently measuring discord in multidimensional sourcesThe VLDB Journal10.1007/s00778-023-00810-333:2(399-423)Online publication date: 20-Sep-2023
https://doi.org/10.1007/s00778-023-00810-3
Li BTian WLi RXiao WFu ZYe XDuan RLi YXu Z(2022)Cross-domain Resemblance Detection based on Meta-learning for Cloud Storage2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894349(374-379)Online publication date: 11-Nov-2022
https://doi.org/10.1109/IPCCC55026.2022.9894349
Lee WTkachenko MLauw H(2021)Robust BiPoly-Matching for Multi-Granular Entities2021 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM51629.2021.00143(1192-1197)Online publication date: Dec-2021
https://doi.org/10.1109/ICDM51629.2021.00143
Orr LBalazinska MSuciu DMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Sample Debiasing in the Themis Open World Database SystemProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380606(257-268)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380606
Zhao CHe Y(2019)Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer LearningThe World Wide Web Conference10.1145/3308558.3313578(2413-2424)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313578
Kaushik R(2018)Deduplication in Data CleaningEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_596(1035-1040)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_596
Wang YSong SChen LYu JCheng H(2017)Discovering Conditional Matching RulesACM Transactions on Knowledge Discovery from Data10.1145/307064711:4(1-38)Online publication date: 29-Jun-2017
https://dl.acm.org/doi/10.1145/3070647
Kaushik R(2017)Deduplication in Data CleaningEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_596-2(1-6)Online publication date: 26-Sep-2017
https://doi.org/10.1007/978-1-4899-7993-3_596-2
Dong XSrivastava D(2017)Entity ResolutionEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_2547-2(1-5)Online publication date: 3-Jan-2017
https://doi.org/10.1007/978-1-4899-7993-3_2547-2
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten