ABSTRACT
Entity Resolution (ER) is an important real world problem that has attracted significant research interest over the past few years. It deals with determining which object descriptions co-refer in a dataset. Due to its practical significance for data mining and data analysis tasks many different ER approaches has been developed to address the ER challenge. This paper proposes a new ER Ensemble framework. The task of ER Ensemble is to combine the results of multiple base-level ER systems into a single solution with the goal of increasing the quality of ER. The framework proposed in this paper leverages the observation that often no single ER method always performs the best, consistently outperforming other ER techniques in terms of quality. Instead, different ER solutions perform better in different contexts. The framework employs two novel combining approaches, which are based on supervised learning. The two approaches learn a mapping of the clustering decisions of the base-level ER systems, together with the local context, into a combined clustering decision. The paper empirically studies the framework by applying it to different domains. The experiments demonstrate that the proposed framework achieves significantly higher disambiguation quality compared to the current state of the art solutions.
- J. Artiles, J. Gonzalo, and S. Sekine. The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task. In SemEval, 2007. Google ScholarDigital Library
- J. Artiles, J. Gonzalo, and F. Verdejo. A testbed for people searching strategies in the www. In SIGIR, 2005. Google ScholarDigital Library
- N. Bansal, A. Blum, and S. Chawla. Correlation clustering. IEEE Sympos. on Foundations of Computer Science, 2002. Google ScholarDigital Library
- R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW, 2005. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, 2004. Google ScholarDigital Library
- M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003. Google ScholarDigital Library
- Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Exploiting relationships for object consolidation. In IQIS Workshop at ACM SIGMOD Conference, June 17 2005. Google ScholarDigital Library
- Z. Chen, D.V. Kalashnikov, and S. Mehrotra. Adaptive graphical approach to entity resolution. In JCDL, 2007. Google ScholarDigital Library
- W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In SIGKDD, 2002. Google ScholarDigital Library
- A. Culotta and A. McCallum. Joint deduplication of multiple record types in relational data. In CIKM, 2005. Google ScholarDigital Library
- H. Cunningham, D. Maynard, K. Bontcheva, and Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In ACL'02.Google Scholar
- X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarDigital Library
- E. Elmacioglu, Y.F. Tan, S. Yan, M.-Y. Kan, and D. Lee. PSNUS: Web people name disambiguation by simple clustering with rich features. In SemEval, 2007. Google ScholarDigital Library
- A.L.N. Fred and A.K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell., 27(6):835--850, 2005. Google ScholarDigital Library
- S. Garner. Weka: The waikato environment for knowledge analysis. In New Zealand Comput. Sci. Res. Conf., 1995.Google Scholar
- A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. In ICDE, 2005. Google ScholarDigital Library
- S.T. Hadjitodorov and L.I. Kuncheva. Selecting diversifying heuristics for cluster ensembles. In Multiple Classifier Systems, 2007. Google ScholarDigital Library
- M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarDigital Library
- D. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining, 2005.Google ScholarCross Ref
- D.V. Kalashnikov, Z. Chen, S. Mehrotra, and R. Nuray. Web people search via connection analysis. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 20(11), Nov. 2008. Google ScholarDigital Library
- D.V. Kalashnikov, Z. Chen, R. Nuray-Turan, S. Mehrotra, and Z. Zhang. WEST: Modern technologies for Web People Search. In ICDE, 2009. Google ScholarDigital Library
- D.V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (ACM TODS), 31(2):716--767, June 2006. Google ScholarDigital Library
- D.V. Kalashnikov, S. Mehrotra, Z. Chen, R. Nuray-Turan, and N. Ashish. Disambiguation algorithm for people search on the web. In ICDE, 2007.Google ScholarCross Ref
- D.V. Kalashnikov, R. Nuray-Turan, and S. Mehrotra. Towards breaking the quality curse. A web-querying approach to Web People Search. In SIGIR, 2008. Google ScholarDigital Library
- J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell., 20(3):226--239, 1998. Google ScholarDigital Library
- B. Long, Z.M. Zhang, and P.S. Yu. Combining multiple clusterings by soft correspondence. In ICDM, 2005. Google ScholarDigital Library
- A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.Google Scholar
- A.K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, 2000. Google ScholarDigital Library
- R. Nuray-Turan, Z. Chen, D.V. Kalashnikov, and S. Mehrotra. Exploiting Web querying for Web People Search in WePS2. In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.Google Scholar
- R. Nuray-Turan, D.V. Kalashnikov, and S. Mehrotra. Self-tuning in graph-based reference disambiguation. In DASFAA, 2007. Google ScholarDigital Library
- B.-W. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei. Improving grouped-entity resolution using quasi-cliques. In ICDM, 2006. Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarDigital Library
- W. Shen, P. DeRose, L. Vu, A. Doan, and R. Ramakrishnan. Source-aware entity matching: A compositional approach. In ICDE, 2007.Google ScholarCross Ref
- G. Sigletos, G. Paliouras, C.D. Spyropoulos, and M. Hatzopoulos. Combining information extraction systems using voting and stacked generalization. Journal of Machine Learning Research, 6:1751--1782, 2005. Google ScholarDigital Library
- P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006. Google ScholarDigital Library
- A. Strehl and J. Ghosh. Cluster ensembles: A knowledge reuse framework for combining partitionings. In Journal of Machine Learning Research, 2002. Google ScholarDigital Library
- S. Tejada, C.A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002. Google ScholarDigital Library
- A. Thor and E. Rahm. Moma -- a mapping-based object matching system. In CIDR, 2007.Google Scholar
- I.H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005. Google ScholarDigital Library
- D. Wolpert. Stacked generalization. Neural Networks, 1992. Google ScholarDigital Library
- H. Zhao and S. Ram. Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation. Inf. Syst., 30(2):119--132, 2005. Google ScholarDigital Library
Index Terms
- Exploiting context analysis for combining multiple entity resolution systems
Recommendations
Collective entity resolution in relational data
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Context-based entity description rule for entity resolution
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementIn this paper, we consider the entity resolution(ER) problem, which is to identify objects referring to the same real-world entity. Prior work of ER involves expensive similarity comparison and clustering approaches. Additionally, the quality of entity ...
Joint entity resolution on multiple datasets
Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact ...
Comments