skip to main content
10.1145/1247480.1247530acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Leveraging aggregate constraints for deduplication

Published: 11 June 2007 Publication History

Abstract

We show that aggregate constraints (as opposed to pairwise constraints) that often arise when integrating multiple sources of data, can be leveraged to enhance the quality of deduplication. However, despite its appeal, we show that the problem is challenging, both semantically and computationally. We define a restricted search space for deduplication that is intuitive in our context and we solve the problem optimally for the restricted space. Our experiments on real data show that incorporating aggregate constraints significantly enhances the accuracy of deduplication.

References

[1]
The K-Means Clustering Algorithm. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html.
[2]
Association for computing machinery. http://www.acm.org.
[3]
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the 27th International Conference on Very Large Databases, 2002.
[4]
J. A. Aslam, K. Pelehov, and D. Rus. A practical clustering algorithm for static and dynamic information organization. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1999.
[5]
N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Mach. Learn., 56(1-3):89--113, 2002.
[6]
I. Bhattacharya and L. Getoor. Collective Entity Resolution In Relational Data. In Data Engineering Bulletin, 2006.
[7]
M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 11, New York, NY, USA, 2004. ACM Press.
[8]
D. Bitton and D. DeWitt. Duplicate record elimination in large data files. ACM Transactions on database systems(TODS), 8(2), 1983.
[9]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
[10]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD, June 2003.
[11]
S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In In proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, April 2005.
[12]
J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. In Information and Computation, 2005.
[13]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw Hill, 2001.
[14]
I. Davidson, K. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. In PKDD, 2006.
[15]
Dblp. http://www.informatik.uni-trier.de/ ley/db/index.html.
[16]
X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.
[17]
A. Elmagarmid, P. G. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. In Information Systems Working Papers, 2006.
[18]
I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
[19]
M. R. Garey and D. S. Johnson. Computers and Intractability. W. H. Freeman and Company, 1979.
[20]
G. Greco, S. Greco, and E. Zumpano. A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Engineering, 15(6), 2003.
[21]
M. M. Halldorsson. Approximations of weighted independent set and hereditary subset problems. Journal of Graph Algorithms and Applications, 2000.
[22]
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, pages 127--138, San Jose, CA, May 1995.
[23]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[24]
J. Wijsen. Condensed representation of database repairs for consistent query answering. In ICDT, 2003.
[25]
I. Knowledge Partners. Business rules applied. http://www.kpiusa.com.
[26]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.
[27]
Lavastorm. Making the case for automated revenue assurance solutions. http://www.lavastormtech.com.
[28]
A. McCallum, K. Nigam, and L. Unger. Efficient clustering of high-dimensional data sets with applications to reference matching. In Proceedings of the ACM SIGKDD international conference on knowledge discovery in databases, pages 169--178, San Francisco, CA, 2000.
[29]
A. Monge and C. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.
[30]
Trillium Inc. www.trilliumsoft.com/trilliumsoft.nsf.
[31]
A. K. H. Tung, R. T. Ng, L. V. S. Lakshmanan, and J. Han. Constraint-based clustering in large databases. In ICDT, 2001.

Cited By

View all
  • (2023)Robust Bidirectional Poly-MatchingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326648035:10(10762-10774)Online publication date: 1-Oct-2023
  • (2023)Eris: efficiently measuring discord in multidimensional sourcesThe VLDB Journal10.1007/s00778-023-00810-333:2(399-423)Online publication date: 20-Sep-2023
  • (2022)Cross-domain Resemblance Detection based on Meta-learning for Cloud Storage2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894349(374-379)Online publication date: 11-Nov-2022
  • Show More Cited By

Index Terms

  1. Leveraging aggregate constraints for deduplication

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
    June 2007
    1210 pages
    ISBN:9781595936868
    DOI:10.1145/1247480
    • General Chairs:
    • Lizhu Zhou,
    • Tok Wang Ling,
    • Program Chair:
    • Beng Chin Ooi
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 June 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. constraint satisfaction
    2. deduplication
    3. entity resolution

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS07
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Robust Bidirectional Poly-MatchingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326648035:10(10762-10774)Online publication date: 1-Oct-2023
    • (2023)Eris: efficiently measuring discord in multidimensional sourcesThe VLDB Journal10.1007/s00778-023-00810-333:2(399-423)Online publication date: 20-Sep-2023
    • (2022)Cross-domain Resemblance Detection based on Meta-learning for Cloud Storage2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894349(374-379)Online publication date: 11-Nov-2022
    • (2021)Robust BiPoly-Matching for Multi-Granular Entities2021 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM51629.2021.00143(1192-1197)Online publication date: Dec-2021
    • (2020)Sample Debiasing in the Themis Open World Database SystemProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380606(257-268)Online publication date: 11-Jun-2020
    • (2019)Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer LearningThe World Wide Web Conference10.1145/3308558.3313578(2413-2424)Online publication date: 13-May-2019
    • (2018)Deduplication in Data CleaningEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_596(1035-1040)Online publication date: 7-Dec-2018
    • (2017)Discovering Conditional Matching RulesACM Transactions on Knowledge Discovery from Data10.1145/307064711:4(1-38)Online publication date: 29-Jun-2017
    • (2017)Deduplication in Data CleaningEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_596-2(1-6)Online publication date: 26-Sep-2017
    • (2017)Entity ResolutionEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_2547-2(1-5)Online publication date: 3-Jan-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media