|
ABSTRACT
Fuzzy duplicate detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy duplicate detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve duplicate detection effectiveness. In this paper, we propose a novel method for fuzzy duplicate detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the duplicate status of children, but rather the probability of descendants being duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art duplicate detection system on three different XML databases.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Acid, L. M. de Campos, J. M. Fernández-Luna, and J. F. Huete. An information retrieval model based on simple bayesian networks. International Journal of Intelligent Systems, 18(2):251--265, Jan. 2003.
|
| |
2
|
|
 |
3
|
|
| |
4
|
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In Conference on Data Mining (SDM), Bethesda, MD, 2006.
|
 |
5
|
|
| |
6
|
|
 |
7
|
|
 |
8
|
|
 |
9
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
10
|
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969.
|
 |
11
|
|
 |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
D. Milano, M. Scannapieco, and T. Catarci. Structure aware xml object identification. In VLDB Workshop on Clean Databases (CleanDB), Seoul, Korea, 2006.
|
| |
16
|
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Tucson, AZ, 1997.
|
| |
17
|
H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science 130, (3381):954--959, 1959.
|
| |
18
|
|
| |
19
|
S. Puhlmann, M. Weis, and F. Naumann. Xml duplicate detection using sorted neigborhoods. In Conference on Extending Database Technology (EDBT), pages 773--791, Munich, Germany, 2006.
|
| |
20
|
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23:3--13, 2000.
|
 |
21
|
|
 |
22
|
|
| |
23
|
P. Singla and P. Domingos. Object identification with attribute-mediated dependences. In Conference on Principals and Practice of Knowledge Discovery in Databases (PKDD), pages 297--308, Porto, Portugal, 2005.
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
W. E. Winkler. Overview of record linkage and current research directions. Technical report, U. S. Bureau of the Census, 2006.
|
| |
29
|
|
|