research-article

Effective self-training author name disambiguation in scholarly digital libraries

Authors:

Anderson A. Ferreira,

Adriano Veloso,

Marcos André Gonçalves,

Alberto H.F. LaenderAuthors Info & Claims

JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries

Pages 39 - 48

https://doi.org/10.1145/1816123.1816130

Published: 21 June 2010 Publication History

Abstract

Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example.

References

[1]

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of SIGMOD, pages 207--216. ACM, 1993.

Digital Library

[2]

R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.

Digital Library

[3]

R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In Proc. of WWW, pages 463--470, Chiba, Japan, 2005. ACM.

Digital Library

[4]

I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM International Conference on Data Mining, Bethesda, MD, USA, 2006.

[5]

I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1):5, 2007.

Digital Library

[6]

C.-C. Chang and C.-J. Lin. LibSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.

[7]

C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.

[8]

R. G. Cota, M. A. Gonçalves, and A. H. F. Laender. A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries. In Proc. of SBBD, pages 20--34, João Pessoa, Paraiba, Brazil, 2007.

[9]

A. Culotta, P. Kanani, R. Hall, M. Wick, and A. McCallum. Author disambiguation using error-driven machine learning with a ranking loss function. In Sixth International Workshop on Information Integration on the Web, Vancouver, Canada, 2007.

[10]

C. P. Diehl, L. Getoor, and G. Namata. Name reference resolution in organizational email archives. In Proc. of the SIAM Intl. Conf. on Data Mining, pages 70--91, Bethesda, MD, USA, 2006.

[11]

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of KDD, pages 226--231, Portland, Oregon, 1996. AAAI Press.

[12]

C. Galvez and F. de Moya Anegón. Approximate personal name--matching through finite-state graphs. Journal of the American Society for Information Science and Technology, 58(13):1960--1976, 2007.

Digital Library

[13]

S. Geisser. Predictive inference: An introduction. Chapman & Hall, New York, 1993.

[14]

H. Han, C. L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proc. of JCDL, pages 296--305, Tucson, AZ, USA, 2004. ACM.

Digital Library

[15]

H. Han, W. Xu, H. Zha, and C. L. Giles. A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proc. of SAC, pages 1065--1069, Santa Fe, New Mexico, 2005. ACM.

Digital Library

[16]

H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In Proc. of JCDL, pages 334--343, Denver, CO, USA, 2005. ACM.

Digital Library

[17]

J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. In Proc. of PKDD, pages 536--544, Berlin, Germany, 2006. Springer.

[18]

P. Kanani, A. McCallum, and C. Pal. Improving author coreference by resource-bounded information gathering from the web. In Proc. of IJCAI, pages 429--434, Hyderabad, India, 2007.

Digital Library

[19]

I.-S. Kang, S.-H. Na, S. Lee, H. Jung, P. Kim, W.-K. Sung, and J.-H. Lee. On co-authorship for author disambiguation. Information Processing & Management, 45(1):84--97, 2009.

Digital Library

[20]

A. H. F. Laender, M. A. Gonçalves, R. G. Cota, A. A. Ferreira, R. L. T. Santos, and A. J. C. Silva. Keeping a digital library clean: new solutions to old problems. In Proc. of DocEng, pages 257--262, 2008.

Digital Library

[21]

I. Lapidot. Self-Organizing-Maps with BIC for Speaker Clustering. Technical report, IDIAP Research Institute, Martigny, Switzerland, 2002.

[22]

D. Lee, J. Kang, P. Mitra, C. L. Giles, and B.-W. On. Are your citations clean? Communications of the ACM, 50(12):33--38, 2007.

Digital Library

[23]

B. Malin. Unsupervised name disambiguation via social network similarity. In Proc. of the Workshop on Link Analysis, Counterterrorism, and Security, pages 93--102, Newport Beach, CA, 2005.

[24]

T. M. Mitchell. Machine Learning. McGraw-Hill, New York, NY, USA, 1997.

Digital Library

[25]

B.-W. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei. An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In Proc. of JCDL, pages 51--52, Chapel Hill, NC, USA, 2006. ACM.

Digital Library

[26]

B.-W. On and D. Lee. Scalable name disambiguation using multi-level graph partition. In Proc. of the SDM Conf., Minneapolis, Minnesota, USA, 2007. SIAM.

[27]

B.-W. On, D. Lee, J. Kang, and P. Mitra. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proc. of JCDL, pages 344--353, Denver, CO, USA, 2005.

Digital Library

[28]

D. A. Pereira, B. A. Ribeiro-Neto, N. Ziviani, A. H. F. Laender, M. A. Gon 'alves, and A. A. Ferreira. Using web information for author name disambiguation. In Proc. of JCDL, pages 49--58, Austin, TX, USA, 2009.

Digital Library

[29]

M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.

[30]

C. J. V. Rijsbergen. Information Retrieval, 2nd edition. Butterworths, London, 1979.

Digital Library

[31]

C. L. Scoville, E. D. Johnson, and A. L. McConnell. When A. Rose is not A. Rose: the vagaries of author searching. Medical reference services quarterly, 22(4):1--11, 2003.

[32]

N. R. Smalheiser and V. I. Torvik. Author Name Disambiguation, volume 43, pages 287--313. 2009.

Digital Library

[33]

J. M. Soler. Separating the articles of authors with the same name. Scientometrics, 72(2):281--290, 2007.

[34]

Y. Song, J. Huang, I. G. Councill, J. Li, and C. L. Giles. Efficient topic-based unsupervised name disambiguation. In Proc. of JCDL, pages 342--351, Vancouver, BC, Canada, 2007. ACM.

Digital Library

[35]

V. I. Torvik and N. R. Smalheiser. Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3(3), 2009.

Digital Library

[36]

V. I. Torvik, M. Weeber, D. R. Swanson, and N. R. Smalheiser. A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2):140--158, 2005.

Digital Library

[37]

P. Treeratpituk and C. L. Giles. Disambiguating authors in academic publications using random forests. In Proc. of JCDL, pages 39--48, Austin, TX, USA, 2009.

Digital Library

[38]

A. Veloso, W. Meira Jr., and M. J. Zaki. Lazy associative classification. In Proc. of ICDM, pages 645--654. IEEE, 2006.

Digital Library

[39]

A. Veloso, W. Meira Jr., M. Cristo, M. Gonçalves, and M. Zaki. Multi-evidence, multi-criteria, lazy associative document classification. In Proc. of CIKM, pages 218--227. ACM, 2006.

Digital Library

[40]

Q. M. Vu, T. Masada, A. Takasu, and J. Adachi. Using a knowledge base to disambiguate personal name in web search results. In Proc. of SAC, pages 839--843, Seoul, Korea, 2007. ACM.

Digital Library

[41]

K.-H. Yang, H.-T. Peng, J.-Y. Jiang, H.-M. Lee, and J.-M. Ho. Author name disambiguation for citations using topic and web correlation. In Proc. of ECDL, pages 185--196, Aarhus, Denmark, 2008. Springer-Verlag.

Digital Library

Cited By

Boukhers ZAsundi N(2023)Deep author name disambiguation using DBLP dataInternational Journal on Digital Libraries10.1007/s00799-023-00361-625:3(431-441)Online publication date: 4-May-2023
https://doi.org/10.1007/s00799-023-00361-6
Firdaus Alqarni WNurmaini SDarmawahyuni ASapitri ARachmatullah MLestari S(2022)Author Classification on Bibliographic Data Using Capsule Networks Architecture2022 9th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI)10.23919/EECSI56542.2022.9946586(101-105)Online publication date: 6-Oct-2022
https://doi.org/10.23919/EECSI56542.2022.9946586
Boukhers ZAsundi N(2022)Whois? Deep Author Name Disambiguation Using Bibliographic DataLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_16(201-215)Online publication date: 15-Sep-2022
https://doi.org/10.1007/978-3-031-16802-4_16
Show More Cited By

Index Terms

Effective self-training author name disambiguation in scholarly digital libraries
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

Author name disambiguation in MEDLINE

Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical ...
Two supervised learning approaches for name disambiguation in author citations
JCDL '04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries

Due to name abbreviations, identical names, name misspellings, and pseudonyms inpublications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of ...
Active associative sampling for author name disambiguation
JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

One of the hardest problems faced by current scholarly digital libraries is author name ambiguity. This problem occurs when, in a set of citation records, there are records of a same author under distinct names, or citation records belonging to distinct ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries

June 2010

424 pages

ISBN:9781450300858

DOI:10.1145/1816123

General Chair:
Jane Hunter
The University of Queensland, Australia
,
Program Chairs:
Carl Lagoze
Cornell University, USA
,
Lee Giles
Pennsylvania State University, USA
,
Yuan-Fang Li
The University of Queensland, Australia

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

JCDL10

Sponsor:

JCDL10: Joint Conference on Digital Libraries

June 21 - 25, 2010

Queensland, Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
572
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Boukhers ZAsundi N(2023)Deep author name disambiguation using DBLP dataInternational Journal on Digital Libraries10.1007/s00799-023-00361-625:3(431-441)Online publication date: 4-May-2023
https://doi.org/10.1007/s00799-023-00361-6
Firdaus Alqarni WNurmaini SDarmawahyuni ASapitri ARachmatullah MLestari S(2022)Author Classification on Bibliographic Data Using Capsule Networks Architecture2022 9th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI)10.23919/EECSI56542.2022.9946586(101-105)Online publication date: 6-Oct-2022
https://doi.org/10.23919/EECSI56542.2022.9946586
Boukhers ZAsundi N(2022)Whois? Deep Author Name Disambiguation Using Bibliographic DataLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_16(201-215)Online publication date: 15-Sep-2022
https://doi.org/10.1007/978-3-031-16802-4_16
Arif TMalik M(2021)Importance of Name Disambiguation in Scientific DatabasesInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT217358(509-514)Online publication date: 1-May-2021
https://doi.org/10.32628/CSEIT217358
Waqas HQadir M(2021)Multilayer heuristics based clustering framework (MHCF) for author name disambiguationScientometrics10.1007/s11192-021-04087-7126:9(7637-7678)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s11192-021-04087-7
Ferreira AGonçalves MLaender A(2020)Automatic Disambiguation of Author Names in Bibliographic RepositoriesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01011ED1V01Y202005ICR07012:1(1-146)Online publication date: 28-May-2020
https://doi.org/10.2200/S01011ED1V01Y202005ICR070
Li HCui YWang T(2020)An Effective Approach for Automatic Author Name Disambiguation Based on Multiple StrategiesProceedings of the 3rd International Conference on Computer Science and Software Engineering10.1145/3403746.3403923(169-175)Online publication date: 22-May-2020
https://dl.acm.org/doi/10.1145/3403746.3403923
Mortazavi SNadimi Shahraki MMosakhani M(2018)Improving the accuracy of the author name disambiguation by using clustering ensembleSignal and Data Processing10.29252/jsdp.14.4.11714:4(117-128)Online publication date: 1-Mar-2018
https://doi.org/10.29252/jsdp.14.4.117
Backes TChen JGonçalves MAllen JFox EKan MPetras V(2018)Effective Unsupervised Author Disambiguation with Relative FrequenciesProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197036(203-212)Online publication date: 23-May-2018
https://dl.acm.org/doi/10.1145/3197026.3197036
Shakeel YKrüger Jvon Nostitz-Wallwitz ILausberger CDurand GSaake GLeich TCarver JHong NJay C(2018)(Automated) literature analysisProceedings of the International Workshop on Software Engineering for Science10.1145/3194747.3194748(20-27)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1145/3194747.3194748
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten