research-article

Indeterministic Handling of Uncertain Decisions in Deduplication

Authors:

Maurice van Keulen,

Norbert RitterAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 4, Issue 2

Article No.: 9, Pages 1 - 25

https://doi.org/10.1145/2435221.2435225

Published: 01 March 2013 Publication History

Abstract

In current research and practice, deduplication is usually considered as a deterministic approach in which database tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this article, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for one of the most likely situations, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Moreover, the deduplication process becomes almost fully automatic and human effort can be largely reduced. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.

Supplementary Material

PDF File (a9-panse_appendix.pdf)

The proof is given in an electronic appendix, available online in the ACM Digital Library.

Download
1.69 MB

References

[1]

Arenas, M., Bertossi, L. E., and Chomicki, J. 1999. Consistent query answers in inconsistent databases. In Proceedings of PODS. 68--79.

Digital Library

[2]

Barbará, D., Garcia-Molina, H., and Porter, D. 1992. The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4, 5, 487--502.

Digital Library

[3]

Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies and Techniques. Springer.

Digital Library

[4]

Benjelloun, O., Das Sarma, A., Halevy, A., and Widom, J. 2006. ULDBs: Databases with uncertainty and lineage. In Proceedings of VLDB. 953--964.

Digital Library

[5]

Beskales, G., Soliman, M. A., Ilyas, I. F., and Ben-David, S. 2009. Modeling and querying possible repairs in duplicate detection. PVLDB 2, 1, 598--609.

Digital Library

[6]

Bleiholder, J. and Naumann, F. 2008. Data fusion. ACM Comput. Surv. 41, 1.

Digital Library

[7]

Dalvi, N. N. and Suciu, D. 2007. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4, 523--544.

Digital Library

[8]

Das Sarma, A., Theobald, M., and Widom, J. 2008. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In Proceedings of ICDE. 1023--1032.

Digital Library

[9]

de Keijzer, A. and van Keulen, M. 2007. Quality measures in uncertain data management. In Proceedings of SUM. 104--115.

Digital Library

[10]

de Keijzer, A. and van Keulen, M. 2008. IMPrECISE: Good-is-good-enough data integration. In Proceedings of ICDE. 1548--1551.

Digital Library

[11]

Dechter, R. 1996. Bucket elimination: A unifying framework for probabilistic inference. In Proceedings of UAI. 211--219.

Digital Library

[12]

Dong, X. L., Halevy, A. Y., and Yu, C. 2009. Data integration with uncertainty. VLDB J. 18, 2, 469--500.

Digital Library

[13]

Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1, 1--16.

Digital Library

[14]

Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 1183--1210.

[15]

Goble, C. A. and Stevens, R. 2008. State of the nation in data integration for bioinformatics. J. Biomed. Informat. 41, 5, 687--693.

Digital Library

[16]

Hassanzadeh, O., Chiang, F., Miller, R. J., and Lee, H. C. 2009. Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2, 1, 1282--1293.

Digital Library

[17]

Hernández, M. A. and Stolfo, S. J. 1995. The merge/purge problem for large databases. In Proceedings of the SIGMOD Conference. 127--138.

Digital Library

[18]

Ioannou, E., Nejdl, W., Niederée, C., and Velegrakis, Y. 2010. On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3, 1, 429--438.

Digital Library

[19]

Karp, R. M. and Luby, M. 1983. Monte-Carlo algorithms for enumeration and reliability problems. In Proceedings of FOCS. 56--64.

Digital Library

[20]

Koch, C. 2008. Approximating predicates and expressive queries on probabilistic databases. In Proceedings of PODS. 99--108.

Digital Library

[21]

Koch, C. 2009. MayBMS: A system for managing large uncertain and probabilistic databases. In Proceedings of Managing and Mining Uncertain Data. Springer.

[22]

Koudas, N., Marathe, A., and Srivastava, D. 2004. Flexible string matching against large databases in practice. In Proceedings of VLDB. 1078--1086.

Digital Library

[23]

Lenzerini, M. 2002. Data integration: A theoretical perspective. In Proceedings of PODS. 233--246.

Digital Library

[24]

Naumann, F. and Herschel, M. 2010. An Introduction to Duplicate Detection. Morgan & Claypool Publishers.

Digital Library

[25]

Newcombe, H. B., Kennedy, J. M., Axford, S. J., and James, A. P. 1959. Automatic linkage of vital records. Science 130, 954--959.

[26]

Ravikumar, P. D. and Cohen, W. W. 2004. A hierarchical graphical model for record linkage. In Proceedings of UAI. 454--461.

Digital Library

[27]

Rota, G. 1964. The number of partitions of a set. Amer. Math. Monthly 71, 5, 498--504.

[28]

Sen, P. and Deshpande, A. 2007. Representing and querying correlated tuples in probabilistic databases. In Proceedings of ICDE. 596--605.

[29]

Suciu, D., Olteanu, D., Ré, C., and Koch, C. 2011. Probabilistic Databases. Morgan & Claypool Publishers.

Digital Library

[30]

Taddei, A., Dalmiani, S., Vellani, A., Rocca, E., Piccini, G., Carducci, T., Gori, A., Borghini, R., Marcheschi, P., Mazzarisi, A., Salvatori, C., and Macerata, A. 2008. Data integration in cardiac surgery health care institution: Experience at G. Pasquinucci Heart Hospital. In Computers in Cardiology. 287--290.

[31]

Talburt, J. R. 2011. Entity Resolution and Information Quality. Morgan-Kaufmann.

Digital Library

[32]

Tseng, F. S.-C., Chen, A. L. P., and Yang, W.-P. 1993. Answering heterogeneous database queries with degrees of uncertainty. Distrib. Parall. Datab. 1, 3, 281--302.

Digital Library

[33]

van Keulen, M. and de Keijzer, A. 2009. Qualitative effects of knowledge rules and user feedback in probabilistic data integration. VLDB J. 18, 5, 1191--1217.

Digital Library

[34]

Wang, Y. R. and Madnick, S. E. 1989. The inter-database instance identification problem in integrating autonomous systems. In Proceedings of ICDE, 1989. IEEE Computer Society, 46--55.

Digital Library

[35]

Widom, J. 2009. Trio: A system for data, uncertainty, and lineage. In Managing and Mining Uncertain Data. Springer.

Cited By

Azeroual ONikiforova ASha K(2023)Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS)10.1109/ICCNS58795.2023.10193478(65-73)Online publication date: 19-Jun-2023
https://doi.org/10.1109/ICCNS58795.2023.10193478
vanKeulen M(2022)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_18-2(1-8)Online publication date: 15-Jun-2022
https://doi.org/10.1007/978-3-319-63962-8_18-2
Keulen M(2019)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_18(1308-1315)Online publication date: 20-Feb-2019
https://doi.org/10.1007/978-3-319-77525-8_18
Show More Cited By

Index Terms

Indeterministic Handling of Uncertain Decisions in Deduplication
1. Information systems
  1. Information systems applications

Recommendations

Naive possibilistic classifiers for imprecise or uncertain numerical data

In real-world problems, input data may be pervaded with uncertainty. In this paper, we investigate the behavior of naive possibilistic classifiers, as a counterpart to naive Bayesian ones, for dealing with classification tasks in the presence of ...
Possibilistic classifiers for uncertain numerical data
ECSQARU'11: Proceedings of the 11th European conference on Symbolic and quantitative approaches to reasoning with uncertainty

In many real-world problems, input data may be pervaded with uncertainty. Naive possibilistic classifiers have been proposed as a counterpart to Bayesian classifiers to deal with classification tasks in presence of uncertainty. Following this line here, ...
Probabilistic frequent itemset mining in uncertain databases
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Probabilistic frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied to standard "certain" transaction databases. The consideration of existential uncertainty of item(sets),...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 4, Issue 2

Special Issue on Entity Resolution

March 2013

88 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/2435221

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2013

Accepted: 01 February 2012

Revised: 01 December 2011

Received: 01 December 2010

Published in JDIQ Volume 4, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
520
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)13

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Azeroual ONikiforova ASha K(2023)Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS)10.1109/ICCNS58795.2023.10193478(65-73)Online publication date: 19-Jun-2023
https://doi.org/10.1109/ICCNS58795.2023.10193478
vanKeulen M(2022)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_18-2(1-8)Online publication date: 15-Jun-2022
https://doi.org/10.1007/978-3-319-63962-8_18-2
Keulen M(2019)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_18(1308-1315)Online publication date: 20-Feb-2019
https://doi.org/10.1007/978-3-319-77525-8_18
Ceravolo PAzzini AAngelini MCatarci TCudré-Mauroux PDamiani EMazak AVan Keulen MJarrar MSantucci GSattler KScannapieco MWimmer MWrembel RZaraket F(2018)Big Data SemanticsJournal on Data Semantics10.1007/s13740-018-0086-27:2(65-85)Online publication date: 23-May-2018
https://doi.org/10.1007/s13740-018-0086-2
Keulen M(2018)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_18-1(1-9)Online publication date: 12-Feb-2018
https://doi.org/10.1007/978-3-319-63962-8_18-1
van Keulen MKaminski BMatheja CKatoen J(2018)Rule-Based Conditioning of Probabilistic DataScalable Uncertainty Management10.1007/978-3-030-00461-3_20(290-305)Online publication date: 11-Sep-2018
https://doi.org/10.1007/978-3-030-00461-3_20
Panse FRitter N(2012)Evaluating indeterministic duplicate detection resultsProceedings of the 6th international conference on Scalable Uncertainty Management10.1007/978-3-642-33362-0_33(433-446)Online publication date: 17-Sep-2012
https://dl.acm.org/doi/10.1007/978-3-642-33362-0_33

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents