ACM Home Page
Please provide us with feedback. Feedback
Structure-based inference of xml similarity for fuzzy duplicate detection
Full text PdfPdf (643 KB)
Source
Conference on Information and Knowledge Management archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management table of contents
Lisbon, Portugal
SESSION: Record linkage and approximate matching (DB) table of contents
Pages 293-302  
Year of Publication: 2007
ISBN:978-1-59593-803-9
Authors
Luís Leitão  Instituto Superior Técnico, Lisbon, Portugal
Pável Calado  Instituto Superior Técnico, Lisbon, Portugal
Melanie Weis  Hasso Plattner Institut, Potsdam, Germany
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 24,   Downloads (12 Months): 160,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1321440.1321483
What is a DOI?

ABSTRACT

Fuzzy duplicate detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy duplicate detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve duplicate detection effectiveness. In this paper, we propose a novel method for fuzzy duplicate detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the duplicate status of children, but rather the probability of descendants being duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art duplicate detection system on three different XML databases.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
S. Acid, L. M. de Campos, J. M. Fernández-Luna, and J. F. Huete. An information retrieval model based on simple bayesian networks. International Journal of Intelligent Systems, 18(2):251--265, Jan. 2003.
 
2
3
 
4
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In Conference on Data Mining (SDM), Bethesda, MD, 2006.
5
 
6
7
8
9
 
10
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969.
11
12
 
13
14
 
15
D. Milano, M. Scannapieco, and T. Catarci. Structure aware xml object identification. In VLDB Workshop on Clean Databases (CleanDB), Seoul, Korea, 2006.
 
16
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Tucson, AZ, 1997.
 
17
H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science 130, (3381):954--959, 1959.
 
18
 
19
S. Puhlmann, M. Weis, and F. Naumann. Xml duplicate detection using sorted neigborhoods. In Conference on Extending Database Technology (EDBT), pages 773--791, Munich, Germany, 2006.
 
20
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23:3--13, 2000.
21
22
 
23
P. Singla and P. Domingos. Object identification with attribute-mediated dependences. In Conference on Principals and Practice of Knowledge Discovery in Databases (PKDD), pages 297--308, Porto, Portugal, 2005.
24
25
26
 
27
 
28
W. E. Winkler. Overview of record linkage and current research directions. Technical report, U. S. Bureau of the Census, 2006.
 
29

Collaborative Colleagues:
Luís Leitão: colleagues
Pável Calado: colleagues
Melanie Weis: colleagues