skip to main content
10.1145/1559845.1559869acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Exploiting context analysis for combining multiple entity resolution systems

Published:29 June 2009Publication History

ABSTRACT

Entity Resolution (ER) is an important real world problem that has attracted significant research interest over the past few years. It deals with determining which object descriptions co-refer in a dataset. Due to its practical significance for data mining and data analysis tasks many different ER approaches has been developed to address the ER challenge. This paper proposes a new ER Ensemble framework. The task of ER Ensemble is to combine the results of multiple base-level ER systems into a single solution with the goal of increasing the quality of ER. The framework proposed in this paper leverages the observation that often no single ER method always performs the best, consistently outperforming other ER techniques in terms of quality. Instead, different ER solutions perform better in different contexts. The framework employs two novel combining approaches, which are based on supervised learning. The two approaches learn a mapping of the clustering decisions of the base-level ER systems, together with the local context, into a combined clustering decision. The paper empirically studies the framework by applying it to different domains. The experiments demonstrate that the proposed framework achieves significantly higher disambiguation quality compared to the current state of the art solutions.

References

  1. J. Artiles, J. Gonzalo, and S. Sekine. The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task. In SemEval, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Artiles, J. Gonzalo, and F. Verdejo. A testbed for people searching strategies in the www. In SIGIR, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Bansal, A. Blum, and S. Chawla. Correlation clustering. IEEE Sympos. on Foundations of Computer Science, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Exploiting relationships for object consolidation. In IQIS Workshop at ACM SIGMOD Conference, June 17 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Chen, D.V. Kalashnikov, and S. Mehrotra. Adaptive graphical approach to entity resolution. In JCDL, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Culotta and A. McCallum. Joint deduplication of multiple record types in relational data. In CIKM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Cunningham, D. Maynard, K. Bontcheva, and Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In ACL'02.Google ScholarGoogle Scholar
  12. X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Elmacioglu, Y.F. Tan, S. Yan, M.-Y. Kan, and D. Lee. PSNUS: Web people name disambiguation by simple clustering with rich features. In SemEval, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A.L.N. Fred and A.K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell., 27(6):835--850, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Garner. Weka: The waikato environment for knowledge analysis. In New Zealand Comput. Sci. Res. Conf., 1995.Google ScholarGoogle Scholar
  16. A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. In ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S.T. Hadjitodorov and L.I. Kuncheva. Selecting diversifying heuristics for cluster ensembles. In Multiple Classifier Systems, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  20. D.V. Kalashnikov, Z. Chen, S. Mehrotra, and R. Nuray. Web people search via connection analysis. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 20(11), Nov. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D.V. Kalashnikov, Z. Chen, R. Nuray-Turan, S. Mehrotra, and Z. Zhang. WEST: Modern technologies for Web People Search. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D.V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (ACM TODS), 31(2):716--767, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D.V. Kalashnikov, S. Mehrotra, Z. Chen, R. Nuray-Turan, and N. Ashish. Disambiguation algorithm for people search on the web. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  24. D.V. Kalashnikov, R. Nuray-Turan, and S. Mehrotra. Towards breaking the quality curse. A web-querying approach to Web People Search. In SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell., 20(3):226--239, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Long, Z.M. Zhang, and P.S. Yu. Combining multiple clusterings by soft correspondence. In ICDM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.Google ScholarGoogle Scholar
  28. A.K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Nuray-Turan, Z. Chen, D.V. Kalashnikov, and S. Mehrotra. Exploiting Web querying for Web People Search in WePS2. In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.Google ScholarGoogle Scholar
  30. R. Nuray-Turan, D.V. Kalashnikov, and S. Mehrotra. Self-tuning in graph-based reference disambiguation. In DASFAA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B.-W. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei. Improving grouped-entity resolution using quasi-cliques. In ICDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Shen, P. DeRose, L. Vu, A. Doan, and R. Ramakrishnan. Source-aware entity matching: A compositional approach. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  34. G. Sigletos, G. Paliouras, C.D. Spyropoulos, and M. Hatzopoulos. Combining information extraction systems using voting and stacked generalization. Journal of Machine Learning Research, 6:1751--1782, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Strehl and J. Ghosh. Cluster ensembles: A knowledge reuse framework for combining partitionings. In Journal of Machine Learning Research, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Tejada, C.A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Thor and E. Rahm. Moma -- a mapping-based object matching system. In CIDR, 2007.Google ScholarGoogle Scholar
  39. I.H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Wolpert. Stacked generalization. Neural Networks, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Zhao and S. Ram. Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation. Inf. Syst., 30(2):119--132, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting context analysis for combining multiple entity resolution systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
      June 2009
      1168 pages
      ISBN:9781605585512
      DOI:10.1145/1559845

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader