skip to main content
10.1145/1816123.1816130acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Effective self-training author name disambiguation in scholarly digital libraries

Published: 21 June 2010 Publication History

Abstract

Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example.

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of SIGMOD, pages 207--216. ACM, 1993.
[2]
R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
[3]
R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In Proc. of WWW, pages 463--470, Chiba, Japan, 2005. ACM.
[4]
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM International Conference on Data Mining, Bethesda, MD, USA, 2006.
[5]
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1):5, 2007.
[6]
C.-C. Chang and C.-J. Lin. LibSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.
[7]
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.
[8]
R. G. Cota, M. A. Gonçalves, and A. H. F. Laender. A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries. In Proc. of SBBD, pages 20--34, João Pessoa, Paraiba, Brazil, 2007.
[9]
A. Culotta, P. Kanani, R. Hall, M. Wick, and A. McCallum. Author disambiguation using error-driven machine learning with a ranking loss function. In Sixth International Workshop on Information Integration on the Web, Vancouver, Canada, 2007.
[10]
C. P. Diehl, L. Getoor, and G. Namata. Name reference resolution in organizational email archives. In Proc. of the SIAM Intl. Conf. on Data Mining, pages 70--91, Bethesda, MD, USA, 2006.
[11]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of KDD, pages 226--231, Portland, Oregon, 1996. AAAI Press.
[12]
C. Galvez and F. de Moya Anegón. Approximate personal name--matching through finite-state graphs. Journal of the American Society for Information Science and Technology, 58(13):1960--1976, 2007.
[13]
S. Geisser. Predictive inference: An introduction. Chapman & Hall, New York, 1993.
[14]
H. Han, C. L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proc. of JCDL, pages 296--305, Tucson, AZ, USA, 2004. ACM.
[15]
H. Han, W. Xu, H. Zha, and C. L. Giles. A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proc. of SAC, pages 1065--1069, Santa Fe, New Mexico, 2005. ACM.
[16]
H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In Proc. of JCDL, pages 334--343, Denver, CO, USA, 2005. ACM.
[17]
J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. In Proc. of PKDD, pages 536--544, Berlin, Germany, 2006. Springer.
[18]
P. Kanani, A. McCallum, and C. Pal. Improving author coreference by resource-bounded information gathering from the web. In Proc. of IJCAI, pages 429--434, Hyderabad, India, 2007.
[19]
I.-S. Kang, S.-H. Na, S. Lee, H. Jung, P. Kim, W.-K. Sung, and J.-H. Lee. On co-authorship for author disambiguation. Information Processing & Management, 45(1):84--97, 2009.
[20]
A. H. F. Laender, M. A. Gonçalves, R. G. Cota, A. A. Ferreira, R. L. T. Santos, and A. J. C. Silva. Keeping a digital library clean: new solutions to old problems. In Proc. of DocEng, pages 257--262, 2008.
[21]
I. Lapidot. Self-Organizing-Maps with BIC for Speaker Clustering. Technical report, IDIAP Research Institute, Martigny, Switzerland, 2002.
[22]
D. Lee, J. Kang, P. Mitra, C. L. Giles, and B.-W. On. Are your citations clean? Communications of the ACM, 50(12):33--38, 2007.
[23]
B. Malin. Unsupervised name disambiguation via social network similarity. In Proc. of the Workshop on Link Analysis, Counterterrorism, and Security, pages 93--102, Newport Beach, CA, 2005.
[24]
T. M. Mitchell. Machine Learning. McGraw-Hill, New York, NY, USA, 1997.
[25]
B.-W. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei. An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In Proc. of JCDL, pages 51--52, Chapel Hill, NC, USA, 2006. ACM.
[26]
B.-W. On and D. Lee. Scalable name disambiguation using multi-level graph partition. In Proc. of the SDM Conf., Minneapolis, Minnesota, USA, 2007. SIAM.
[27]
B.-W. On, D. Lee, J. Kang, and P. Mitra. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proc. of JCDL, pages 344--353, Denver, CO, USA, 2005.
[28]
D. A. Pereira, B. A. Ribeiro-Neto, N. Ziviani, A. H. F. Laender, M. A. Gon 'alves, and A. A. Ferreira. Using web information for author name disambiguation. In Proc. of JCDL, pages 49--58, Austin, TX, USA, 2009.
[29]
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
[30]
C. J. V. Rijsbergen. Information Retrieval, 2nd edition. Butterworths, London, 1979.
[31]
C. L. Scoville, E. D. Johnson, and A. L. McConnell. When A. Rose is not A. Rose: the vagaries of author searching. Medical reference services quarterly, 22(4):1--11, 2003.
[32]
N. R. Smalheiser and V. I. Torvik. Author Name Disambiguation, volume 43, pages 287--313. 2009.
[33]
J. M. Soler. Separating the articles of authors with the same name. Scientometrics, 72(2):281--290, 2007.
[34]
Y. Song, J. Huang, I. G. Councill, J. Li, and C. L. Giles. Efficient topic-based unsupervised name disambiguation. In Proc. of JCDL, pages 342--351, Vancouver, BC, Canada, 2007. ACM.
[35]
V. I. Torvik and N. R. Smalheiser. Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3(3), 2009.
[36]
V. I. Torvik, M. Weeber, D. R. Swanson, and N. R. Smalheiser. A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2):140--158, 2005.
[37]
P. Treeratpituk and C. L. Giles. Disambiguating authors in academic publications using random forests. In Proc. of JCDL, pages 39--48, Austin, TX, USA, 2009.
[38]
A. Veloso, W. Meira Jr., and M. J. Zaki. Lazy associative classification. In Proc. of ICDM, pages 645--654. IEEE, 2006.
[39]
A. Veloso, W. Meira Jr., M. Cristo, M. Gonçalves, and M. Zaki. Multi-evidence, multi-criteria, lazy associative document classification. In Proc. of CIKM, pages 218--227. ACM, 2006.
[40]
Q. M. Vu, T. Masada, A. Takasu, and J. Adachi. Using a knowledge base to disambiguate personal name in web search results. In Proc. of SAC, pages 839--843, Seoul, Korea, 2007. ACM.
[41]
K.-H. Yang, H.-T. Peng, J.-Y. Jiang, H.-M. Lee, and J.-M. Ho. Author name disambiguation for citations using topic and web correlation. In Proc. of ECDL, pages 185--196, Aarhus, Denmark, 2008. Springer-Verlag.

Cited By

View all
  • (2023)Deep author name disambiguation using DBLP dataInternational Journal on Digital Libraries10.1007/s00799-023-00361-625:3(431-441)Online publication date: 4-May-2023
  • (2022)Author Classification on Bibliographic Data Using Capsule Networks Architecture2022 9th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI)10.23919/EECSI56542.2022.9946586(101-105)Online publication date: 6-Oct-2022
  • (2022)Whois? Deep Author Name Disambiguation Using Bibliographic DataLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_16(201-215)Online publication date: 15-Sep-2022
  • Show More Cited By

Index Terms

  1. Effective self-training author name disambiguation in scholarly digital libraries

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries
      June 2010
      424 pages
      ISBN:9781450300858
      DOI:10.1145/1816123
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 June 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. bibliographic citations
      2. name disambiguation

      Qualifiers

      • Research-article

      Conference

      JCDL10
      Sponsor:
      JCDL10: Joint Conference on Digital Libraries
      June 21 - 25, 2010
      Queensland, Gold Coast, Australia

      Acceptance Rates

      Overall Acceptance Rate 415 of 1,482 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Deep author name disambiguation using DBLP dataInternational Journal on Digital Libraries10.1007/s00799-023-00361-625:3(431-441)Online publication date: 4-May-2023
      • (2022)Author Classification on Bibliographic Data Using Capsule Networks Architecture2022 9th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI)10.23919/EECSI56542.2022.9946586(101-105)Online publication date: 6-Oct-2022
      • (2022)Whois? Deep Author Name Disambiguation Using Bibliographic DataLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_16(201-215)Online publication date: 15-Sep-2022
      • (2021)Importance of Name Disambiguation in Scientific DatabasesInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT217358(509-514)Online publication date: 1-May-2021
      • (2021)Multilayer heuristics based clustering framework (MHCF) for author name disambiguationScientometrics10.1007/s11192-021-04087-7126:9(7637-7678)Online publication date: 1-Sep-2021
      • (2020)Automatic Disambiguation of Author Names in Bibliographic RepositoriesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01011ED1V01Y202005ICR07012:1(1-146)Online publication date: 28-May-2020
      • (2020)An Effective Approach for Automatic Author Name Disambiguation Based on Multiple StrategiesProceedings of the 3rd International Conference on Computer Science and Software Engineering10.1145/3403746.3403923(169-175)Online publication date: 22-May-2020
      • (2018)Improving the accuracy of the author name disambiguation by using clustering ensembleSignal and Data Processing10.29252/jsdp.14.4.11714:4(117-128)Online publication date: 1-Mar-2018
      • (2018)Effective Unsupervised Author Disambiguation with Relative FrequenciesProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197036(203-212)Online publication date: 23-May-2018
      • (2018)(Automated) literature analysisProceedings of the International Workshop on Software Engineering for Science10.1145/3194747.3194748(20-27)Online publication date: 2-Jun-2018
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media