skip to main content
10.1145/2396761.2398403acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

An automatic blocking mechanism for large-scale de-duplication tasks

Published: 29 October 2012 Publication History

Abstract

De-duplication - identification of distinct records referring to the same real-world entity - is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, blocking has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off recall of identified duplicates for efficiency. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy formation be implemented as a hash function, making the canopy design problem more challenging. We present CBLOCK, a system that addresses these challenges.
CBLOCK learns hash functions automatically from attribute domains and a labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions using a hierarchical tree structure composed of atomic hash functions. The application may guide the automated blocking process based on architectural constraints, such as by specifying a maximum size of each block (based on memory requirements), impose disjointness of blocks (in a grid environment), or specify a particular objective function trading off recall for efficiency. As a post-processing step to automatically generated blocks, CBLOCK rolls-up smaller blocks to increase recall. We present experimental results on two large-scale de-duplication datasets from a commercial search engine - consisting of over 140K movies and 40K restaurants respectively - and demonstrate the utility of CBLOCK.

References

[1]
DBPedia. http://dbpedia.org/.
[2]
IMDb: The Internet Movie Database. http://www.imdb.com.
[3]
MAHOUT. http://cwiki.apache.org/MAHOUT/canopy-clustering.html.
[4]
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In KDD, 2003.
[5]
O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB Journal, 18(1), 2009.
[6]
O. Benjelloun, H. Molina, H. Gong, H. Kawai, T. E. Larson, D. Menestrina, and S. Thavisomboon. D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution. Technical Report, Stanford University, 2007.
[7]
M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage and clustering. In ICDM, 2006.
[8]
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 2011.
[9]
P. Christen and T. C. Febrl. Freely extensible biomedical record linkage manual. http://datamining.anu.edu.au/linkage.html, 2, 2003.
[10]
G. B. Dantzig. Discrete-variable extremum problems. Operations Research, 5(2), 1957.
[11]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 2007.
[12]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(2283), 1969.
[13]
M. R. Garey and D. S. Johnson. Computers and Intractability. W. H. Freeman and Company, 1979.
[14]
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995.
[15]
Y. E. Ioannidis. The history of histograms (abridged). In VLDB, 2003.
[16]
M. A. Jaro. Advances in record linkage methodology as applied to matching the 1985 census of tampa. JASA, 84, 1989.
[17]
L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA, 2003.
[18]
R. P. Kelley. Advances in record linkage methodology: a method for determining the best blocking strategy. Record Linkage Techniques, 1985.
[19]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithms. Tutorial at SIGMOD, 2006.
[20]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, 2000.
[21]
M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006.
[22]
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, 2011.
[23]
S. Rendle and L. Schmidt-Thieme. Scaling record linkage to non-uniform distributed class sizes. In PAKDD, 2008.
[24]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.
[25]
H. sik Kim and D. Lee. Parallel linkage. In CIKM, 2007.
[26]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010.
[27]
S. E. F. W. Cohen, P. Ravikumar. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.
[28]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD, 2009.
[29]
W. Winkler. Overview of record linkage and current research directions. Statistical Research Division, U.S. Bureau of the Census, Tech. Report, 2006.

Cited By

View all
  • (2024)Evaluating Blocking Biases in Entity Matching2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825531(64-73)Online publication date: 15-Dec-2024
  • (2022)JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentrationKnowledge-Based Systems10.1016/j.knosys.2022.109033251(109033)Online publication date: Sep-2022
  • (2021)Deep learning for blocking in entity matchingProceedings of the VLDB Endowment10.14778/3476249.347629414:11(2459-2472)Online publication date: 27-Oct-2021
  • Show More Cited By

Index Terms

  1. An automatic blocking mechanism for large-scale de-duplication tasks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
    October 2012
    2840 pages
    ISBN:9781450311564
    DOI:10.1145/2396761
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 October 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. blocking
    2. canopy formation
    3. de-duplication

    Qualifiers

    • Research-article

    Conference

    CIKM'12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Evaluating Blocking Biases in Entity Matching2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825531(64-73)Online publication date: 15-Dec-2024
    • (2022)JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentrationKnowledge-Based Systems10.1016/j.knosys.2022.109033251(109033)Online publication date: Sep-2022
    • (2021)Deep learning for blocking in entity matchingProceedings of the VLDB Endowment10.14778/3476249.347629414:11(2459-2472)Online publication date: 27-Oct-2021
    • (2021)BUBBLE : A Quality-Aware Human-in-the-loop Entity Matching Framework2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9672002(3557-3565)Online publication date: 15-Dec-2021
    • (2019)Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00181(1718-1721)Online publication date: Apr-2019
    • (2018)Mobile Access Record Resolution on Large-Scale Identifier-Linkage GraphsProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219916(886-894)Online publication date: 19-Jul-2018
    • (2017)Human-in-the-loop data integrationProceedings of the VLDB Endowment10.14778/3137765.313783310:12(2006-2017)Online publication date: 1-Aug-2017
    • (2017)FalconProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3035960(1431-1446)Online publication date: 9-May-2017
    • (2016)MagellanProceedings of the VLDB Endowment10.14778/2994509.29945359:12(1197-1208)Online publication date: 1-Aug-2016
    • (2015)A Clustering-Based Framework to Control Block Sizes for Entity ResolutionProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783396(279-288)Online publication date: 10-Aug-2015
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media