ACM Home Page
Please provide us with feedback. Feedback
Learning to deduplicate
Full text PdfPdf (186 KB)
Source International Conference on Digital Libraries archive
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries table of contents
Chapel Hill, NC, USA
SESSION: Named entities 1 table of contents
Pages: 41 - 50  
Year of Publication: 2006
ISBN:1-59593-354-9
Authors
Moisés G. de Carvalho  Federal University of Minas Gerais, Belo Horizonte, Brazil
Marcos André Gonçalves  Federal University of Minas Gerais, Belo Horizonte, Brazil
Alberto H. F. Laender  Federal University of Minas Gerais, Belo Horizonte, Brazil
Altigran S. da Silva  Federal University of Amazonas, Manaus, Brazil
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 100,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1141753.1141760
What is a DOI?

ABSTRACT

Identifying record replicas in Digital Libraries and other types of digital repositories is fundamental to improve the quality of their content and services as well as to yield eventual sharing efforts. Several deduplication strategies are available, but most of them rely on manually chosen settings to combine evidence used to identify records as being replicas. In this paper, we present the results of experiments we have carried out with a novel Machine Learning approach we have proposed for the deduplication problem. This approach, based on Genetic Programming (GP), is able to automatically generate similarity functions to identify record replicas in a given repository. The generated similarity functions properly combine and weight the best evidence available among the record fields in order to tell when two distinct records represent the same real-world entity. The results of the experiments show that our approach outperforms the baseline method by Fellegi and Sunter by more than 12% when identifying replicas in a data set containing researcher's personal data, and by more than 7%, in a data set with article citation data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
4
 
5
6
7
8
9
 
10
Freely Extensible Biomedical Record Linkage. http://sourceforge.net/projects/febrl.
 
11
Fellegi, I. P., and Sunter, A. B. A theory for record linkage. Journal of American Statistical Association 66, 1 (1969), 1183--1210.
 
12
Guha, S., Koudas, N., Marathe, A., and Srivastava, D. Merging the results of approximate match operations. In Proc. of VLDB (2004), pp. 636--647.
13
 
14
15
 
16
17
 
18


Collaborative Colleagues:
Moisés G. de Carvalho: colleagues
Marcos André Gonçalves: colleagues
Alberto H. F. Laender: colleagues
Altigran S. da Silva: colleagues