research-article

An automatic blocking mechanism for large-scale de-duplication tasks

Authors:

Anish Das Sarma,

Ashwin Machanavajjhala,

Philip BohannonAuthors Info & Claims

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 1055 - 1064

https://doi.org/10.1145/2396761.2398403

Published: 29 October 2012 Publication History

Abstract

De-duplication - identification of distinct records referring to the same real-world entity - is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, blocking has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off recall of identified duplicates for efficiency. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy formation be implemented as a hash function, making the canopy design problem more challenging. We present CBLOCK, a system that addresses these challenges.

CBLOCK learns hash functions automatically from attribute domains and a labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions using a hierarchical tree structure composed of atomic hash functions. The application may guide the automated blocking process based on architectural constraints, such as by specifying a maximum size of each block (based on memory requirements), impose disjointness of blocks (in a grid environment), or specify a particular objective function trading off recall for efficiency. As a post-processing step to automatically generated blocks, CBLOCK rolls-up smaller blocks to increase recall. We present experimental results on two large-scale de-duplication datasets from a commercial search engine - consisting of over 140K movies and 40K restaurants respectively - and demonstrate the utility of CBLOCK.

References

[1]

DBPedia. http://dbpedia.org/.

[2]

IMDb: The Internet Movie Database. http://www.imdb.com.

[3]

MAHOUT. http://cwiki.apache.org/MAHOUT/canopy-clustering.html.

[4]

R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In KDD, 2003.

[5]

O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB Journal, 18(1), 2009.

Digital Library

[6]

O. Benjelloun, H. Molina, H. Gong, H. Kawai, T. E. Larson, D. Menestrina, and S. Thavisomboon. D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution. Technical Report, Stanford University, 2007.

Digital Library

[7]

M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage and clustering. In ICDM, 2006.

Digital Library

[8]

P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 2011.

Digital Library

[9]

P. Christen and T. C. Febrl. Freely extensible biomedical record linkage manual. http://datamining.anu.edu.au/linkage.html, 2, 2003.

[10]

G. B. Dantzig. Discrete-variable extremum problems. Operations Research, 5(2), 1957.

[11]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 2007.

Digital Library

[12]

I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(2283), 1969.

[13]

M. R. Garey and D. S. Johnson. Computers and Intractability. W. H. Freeman and Company, 1979.

Digital Library

[14]

M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995.

Digital Library

[15]

Y. E. Ioannidis. The history of histograms (abridged). In VLDB, 2003.

Digital Library

[16]

M. A. Jaro. Advances in record linkage methodology as applied to matching the 1985 census of tampa. JASA, 84, 1989.

[17]

L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA, 2003.

Digital Library

[18]

R. P. Kelley. Advances in record linkage methodology: a method for determining the best blocking strategy. Record Linkage Techniques, 1985.

[19]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithms. Tutorial at SIGMOD, 2006.

Digital Library

[20]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, 2000.

Digital Library

[21]

M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006.

Digital Library

[22]

A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, 2011.

Digital Library

[23]

S. Rendle and L. Schmidt-Thieme. Scaling record linkage to non-uniform distributed class sizes. In PAKDD, 2008.

Digital Library

[24]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.

Digital Library

[25]

H. sik Kim and D. Lee. Parallel linkage. In CIKM, 2007.

Digital Library

[26]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010.

Digital Library

[27]

S. E. F. W. Cohen, P. Ravikumar. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.

[28]

S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD, 2009.

Digital Library

[29]

W. Winkler. Overview of record linkage and current research directions. Statistical Research Division, U.S. Bureau of the Census, Tech. Report, 2006.

Cited By

Moslemi MBalamurugan HMilani M(2024)Evaluating Blocking Biases in Entity Matching2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825531(64-73)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825531
Ye CJiang SZhang HWu YShi JWang HDai G(2022)JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentrationKnowledge-Based Systems10.1016/j.knosys.2022.109033251(109033)Online publication date: Sep-2022
https://doi.org/10.1016/j.knosys.2022.109033
Thirumuruganathan SLi HTang NOuzzani MGovind YPaulsen DFung GDoan A(2021)Deep learning for blocking in entity matchingProceedings of the VLDB Endowment10.14778/3476249.347629414:11(2459-2472)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476294
Show More Cited By

Index Terms

An automatic blocking mechanism for large-scale de-duplication tasks
1. Information systems

Recommendations

Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Locality sensitive blocking (LSB): A robust blocking technique for data deduplication

Data deduplication is process of discovering multiple representations of same entity in an information system. Blocking has been a benchmark technique for avoiding the pair-wise record comparisons in data deduplication. Standard blocking (SB) aims at ...
High-Value Token-Blocking: Efficient Blocking Method for Record Linkage
Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

October 2012

2840 pages

ISBN:9781450311564

DOI:10.1145/2396761

General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'12

Sponsor:

CIKM'12: 21st ACM International Conference on Information and Knowledge Management

October 29 - November 2, 2012

Hawaii, Maui, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
423
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Moslemi MBalamurugan HMilani M(2024)Evaluating Blocking Biases in Entity Matching2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825531(64-73)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825531
Ye CJiang SZhang HWu YShi JWang HDai G(2022)JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentrationKnowledge-Based Systems10.1016/j.knosys.2022.109033251(109033)Online publication date: Sep-2022
https://doi.org/10.1016/j.knosys.2022.109033
Thirumuruganathan SLi HTang NOuzzani MGovind YPaulsen DFung GDoan A(2021)Deep learning for blocking in entity matchingProceedings of the VLDB Endowment10.14778/3476249.347629414:11(2459-2472)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476294
Osawa NIto HFukushima YHarada TMorishima A(2021)BUBBLE : A Quality-Aware Human-in-the-loop Entity Matching Framework2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9672002(3557-3565)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9672002
Verroios VGarcia-Molina H(2019)Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00181(1718-1721)Online publication date: Apr-2019
https://doi.org/10.1109/ICDE.2019.00181
Xin SYang HXian WEster MBu JWang ZWang CGuo YFarooq F(2018)Mobile Access Record Resolution on Large-Scale Identifier-Linkage GraphsProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219916(886-894)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3219916
Li G(2017)Human-in-the-loop data integrationProceedings of the VLDB Endowment10.14778/3137765.313783310:12(2006-2017)Online publication date: 1-Aug-2017
https://dl.acm.org/doi/10.14778/3137765.3137833
Das SG.C. PDoan ANaughton JKrishnan GDeep RArcaute ERaghavendra VPark YChirkova RYang JSuciu D(2017)FalconProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3035960(1431-1446)Online publication date: 9-May-2017
https://dl.acm.org/doi/10.1145/3035918.3035960
Konda PDas SSuganthan G. C. PDoan AArdalan ABallard JLi HPanahi FZhang HNaughton JPrasad SKrishnan GDeep RRaghavendra V(2016)MagellanProceedings of the VLDB Endowment10.14778/2994509.29945359:12(1197-1208)Online publication date: 1-Aug-2016
https://dl.acm.org/doi/10.14778/2994509.2994535
Fisher JChristen PWang QRahm ECao LZhang CJoachims TWebb GMargineantu DWilliams G(2015)A Clustering-Based Framework to Control Block Sizes for Entity ResolutionProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783396(279-288)Online publication date: 10-Aug-2015
https://dl.acm.org/doi/10.1145/2783258.2783396
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten