research-article

Exploiting context analysis for combining multiple entity resolution systems

Authors:
Zhaoqi Chen

Microsoft Corporation, Redmond, USA

Microsoft Corporation, Redmond, USA
View Profile

,
Dmitri V. Kalashnikov

University of California, Irvine, Irvine, CA, USA

University of California, Irvine, Irvine, CA, USA
View Profile

,
Sharad Mehrotra

University of California, Irvine, Irvine, CA, USA

University of California, Irvine, Irvine, CA, USA
View Profile

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataJune 2009Pages 207–218https://doi.org/10.1145/1559845.1559869

Published:29 June 2009Publication History

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Pages 207–218

ABSTRACT

Entity Resolution (ER) is an important real world problem that has attracted significant research interest over the past few years. It deals with determining which object descriptions co-refer in a dataset. Due to its practical significance for data mining and data analysis tasks many different ER approaches has been developed to address the ER challenge. This paper proposes a new ER Ensemble framework. The task of ER Ensemble is to combine the results of multiple base-level ER systems into a single solution with the goal of increasing the quality of ER. The framework proposed in this paper leverages the observation that often no single ER method always performs the best, consistently outperforming other ER techniques in terms of quality. Instead, different ER solutions perform better in different contexts. The framework employs two novel combining approaches, which are based on supervised learning. The two approaches learn a mapping of the clustering decisions of the base-level ER systems, together with the local context, into a combined clustering decision. The paper empirically studies the framework by applying it to different domains. The experiments demonstrate that the proposed framework achieves significantly higher disambiguation quality compared to the current state of the art solutions.

References

J. Artiles, J. Gonzalo, and S. Sekine. The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task. In SemEval, 2007. Google ScholarDigital Library
J. Artiles, J. Gonzalo, and F. Verdejo. A testbed for people searching strategies in the www. In SIGIR, 2005. Google ScholarDigital Library
N. Bansal, A. Blum, and S. Chawla. Correlation clustering. IEEE Sympos. on Foundations of Computer Science, 2002. Google ScholarDigital Library
R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW, 2005. Google ScholarDigital Library
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, 2004. Google ScholarDigital Library
M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003. Google ScholarDigital Library
Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Exploiting relationships for object consolidation. In IQIS Workshop at ACM SIGMOD Conference, June 17 2005. Google ScholarDigital Library
Z. Chen, D.V. Kalashnikov, and S. Mehrotra. Adaptive graphical approach to entity resolution. In JCDL, 2007. Google ScholarDigital Library
W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In SIGKDD, 2002. Google ScholarDigital Library
A. Culotta and A. McCallum. Joint deduplication of multiple record types in relational data. In CIKM, 2005. Google ScholarDigital Library
H. Cunningham, D. Maynard, K. Bontcheva, and Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In ACL'02.Google Scholar
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarDigital Library
E. Elmacioglu, Y.F. Tan, S. Yan, M.-Y. Kan, and D. Lee. PSNUS: Web people name disambiguation by simple clustering with rich features. In SemEval, 2007. Google ScholarDigital Library
A.L.N. Fred and A.K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell., 27(6):835--850, 2005. Google ScholarDigital Library
S. Garner. Weka: The waikato environment for knowledge analysis. In New Zealand Comput. Sci. Res. Conf., 1995.Google Scholar
A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. In ICDE, 2005. Google ScholarDigital Library
S.T. Hadjitodorov and L.I. Kuncheva. Selecting diversifying heuristics for cluster ensembles. In Multiple Classifier Systems, 2007. Google ScholarDigital Library
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarDigital Library
D. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining, 2005.Google ScholarCross Ref
D.V. Kalashnikov, Z. Chen, S. Mehrotra, and R. Nuray. Web people search via connection analysis. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 20(11), Nov. 2008. Google ScholarDigital Library
D.V. Kalashnikov, Z. Chen, R. Nuray-Turan, S. Mehrotra, and Z. Zhang. WEST: Modern technologies for Web People Search. In ICDE, 2009. Google ScholarDigital Library
D.V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (ACM TODS), 31(2):716--767, June 2006. Google ScholarDigital Library
D.V. Kalashnikov, S. Mehrotra, Z. Chen, R. Nuray-Turan, and N. Ashish. Disambiguation algorithm for people search on the web. In ICDE, 2007.Google ScholarCross Ref
D.V. Kalashnikov, R. Nuray-Turan, and S. Mehrotra. Towards breaking the quality curse. A web-querying approach to Web People Search. In SIGIR, 2008. Google ScholarDigital Library
J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell., 20(3):226--239, 1998. Google ScholarDigital Library
B. Long, Z.M. Zhang, and P.S. Yu. Combining multiple clusterings by soft correspondence. In ICDM, 2005. Google ScholarDigital Library
A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.Google Scholar
A.K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, 2000. Google ScholarDigital Library
R. Nuray-Turan, Z. Chen, D.V. Kalashnikov, and S. Mehrotra. Exploiting Web querying for Web People Search in WePS2. In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.Google Scholar
R. Nuray-Turan, D.V. Kalashnikov, and S. Mehrotra. Self-tuning in graph-based reference disambiguation. In DASFAA, 2007. Google ScholarDigital Library
B.-W. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei. Improving grouped-entity resolution using quasi-cliques. In ICDM, 2006. Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarDigital Library
W. Shen, P. DeRose, L. Vu, A. Doan, and R. Ramakrishnan. Source-aware entity matching: A compositional approach. In ICDE, 2007.Google ScholarCross Ref
G. Sigletos, G. Paliouras, C.D. Spyropoulos, and M. Hatzopoulos. Combining information extraction systems using voting and stacked generalization. Journal of Machine Learning Research, 6:1751--1782, 2005. Google ScholarDigital Library
P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006. Google ScholarDigital Library
A. Strehl and J. Ghosh. Cluster ensembles: A knowledge reuse framework for combining partitionings. In Journal of Machine Learning Research, 2002. Google ScholarDigital Library
S. Tejada, C.A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002. Google ScholarDigital Library
A. Thor and E. Rahm. Moma -- a mapping-based object matching system. In CIDR, 2007.Google Scholar
I.H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005. Google ScholarDigital Library
D. Wolpert. Stacked generalization. Neural Networks, 1992. Google ScholarDigital Library
H. Zhao and S. Ram. Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation. Inf. Syst., 30(2):119--132, 2005. Google ScholarDigital Library

Index Terms

Exploiting context analysis for combining multiple entity resolution systems
1. Information systems
  1. Data management systems

Recommendations

Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Read More
Context-based entity description rule for entity resolution
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

In this paper, we consider the entity resolution(ER) problem, which is to identify objects referring to the same real-world entity. Prior work of ER involves expensive similarity comparison and clustering approaches. Additionally, the quality of entity ...
Read More
Joint entity resolution on multiple datasets

Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Editors:
Carsten Binnig,
Benoit Dageville,
General Chairs:
Uğur Çetintemel
Brown University, USA
,
Stan Zdonik
Brown University, USA
,
Program Chair:
Donald Kossmann
ETH Zurich, Switzerland
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
context analysis
entity resolution
er ensemble
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 62
  Total Citations
  View Citations
- 818
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting context analysis for combining multiple entity resolution systems

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Collective entity resolution in relational data

Context-based entity description rule for entity resolution

Joint entity resolution on multiple datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Exploiting context analysis for combining multiple entity resolution systems

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Collective entity resolution in relational data

Context-based entity description rule for entity resolution

Joint entity resolution on multiple datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media