skip to main content
research-article

Indeterministic Handling of Uncertain Decisions in Deduplication

Published: 01 March 2013 Publication History

Abstract

In current research and practice, deduplication is usually considered as a deterministic approach in which database tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this article, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for one of the most likely situations, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Moreover, the deduplication process becomes almost fully automatic and human effort can be largely reduced. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.

Supplementary Material

PDF File (a9-panse_appendix.pdf)
The proof is given in an electronic appendix, available online in the ACM Digital Library.

References

[1]
Arenas, M., Bertossi, L. E., and Chomicki, J. 1999. Consistent query answers in inconsistent databases. In Proceedings of PODS. 68--79.
[2]
Barbará, D., Garcia-Molina, H., and Porter, D. 1992. The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4, 5, 487--502.
[3]
Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies and Techniques. Springer.
[4]
Benjelloun, O., Das Sarma, A., Halevy, A., and Widom, J. 2006. ULDBs: Databases with uncertainty and lineage. In Proceedings of VLDB. 953--964.
[5]
Beskales, G., Soliman, M. A., Ilyas, I. F., and Ben-David, S. 2009. Modeling and querying possible repairs in duplicate detection. PVLDB 2, 1, 598--609.
[6]
Bleiholder, J. and Naumann, F. 2008. Data fusion. ACM Comput. Surv. 41, 1.
[7]
Dalvi, N. N. and Suciu, D. 2007. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4, 523--544.
[8]
Das Sarma, A., Theobald, M., and Widom, J. 2008. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In Proceedings of ICDE. 1023--1032.
[9]
de Keijzer, A. and van Keulen, M. 2007. Quality measures in uncertain data management. In Proceedings of SUM. 104--115.
[10]
de Keijzer, A. and van Keulen, M. 2008. IMPrECISE: Good-is-good-enough data integration. In Proceedings of ICDE. 1548--1551.
[11]
Dechter, R. 1996. Bucket elimination: A unifying framework for probabilistic inference. In Proceedings of UAI. 211--219.
[12]
Dong, X. L., Halevy, A. Y., and Yu, C. 2009. Data integration with uncertainty. VLDB J. 18, 2, 469--500.
[13]
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1, 1--16.
[14]
Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 1183--1210.
[15]
Goble, C. A. and Stevens, R. 2008. State of the nation in data integration for bioinformatics. J. Biomed. Informat. 41, 5, 687--693.
[16]
Hassanzadeh, O., Chiang, F., Miller, R. J., and Lee, H. C. 2009. Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2, 1, 1282--1293.
[17]
Hernández, M. A. and Stolfo, S. J. 1995. The merge/purge problem for large databases. In Proceedings of the SIGMOD Conference. 127--138.
[18]
Ioannou, E., Nejdl, W., Niederée, C., and Velegrakis, Y. 2010. On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3, 1, 429--438.
[19]
Karp, R. M. and Luby, M. 1983. Monte-Carlo algorithms for enumeration and reliability problems. In Proceedings of FOCS. 56--64.
[20]
Koch, C. 2008. Approximating predicates and expressive queries on probabilistic databases. In Proceedings of PODS. 99--108.
[21]
Koch, C. 2009. MayBMS: A system for managing large uncertain and probabilistic databases. In Proceedings of Managing and Mining Uncertain Data. Springer.
[22]
Koudas, N., Marathe, A., and Srivastava, D. 2004. Flexible string matching against large databases in practice. In Proceedings of VLDB. 1078--1086.
[23]
Lenzerini, M. 2002. Data integration: A theoretical perspective. In Proceedings of PODS. 233--246.
[24]
Naumann, F. and Herschel, M. 2010. An Introduction to Duplicate Detection. Morgan & Claypool Publishers.
[25]
Newcombe, H. B., Kennedy, J. M., Axford, S. J., and James, A. P. 1959. Automatic linkage of vital records. Science 130, 954--959.
[26]
Ravikumar, P. D. and Cohen, W. W. 2004. A hierarchical graphical model for record linkage. In Proceedings of UAI. 454--461.
[27]
Rota, G. 1964. The number of partitions of a set. Amer. Math. Monthly 71, 5, 498--504.
[28]
Sen, P. and Deshpande, A. 2007. Representing and querying correlated tuples in probabilistic databases. In Proceedings of ICDE. 596--605.
[29]
Suciu, D., Olteanu, D., Ré, C., and Koch, C. 2011. Probabilistic Databases. Morgan & Claypool Publishers.
[30]
Taddei, A., Dalmiani, S., Vellani, A., Rocca, E., Piccini, G., Carducci, T., Gori, A., Borghini, R., Marcheschi, P., Mazzarisi, A., Salvatori, C., and Macerata, A. 2008. Data integration in cardiac surgery health care institution: Experience at G. Pasquinucci Heart Hospital. In Computers in Cardiology. 287--290.
[31]
Talburt, J. R. 2011. Entity Resolution and Information Quality. Morgan-Kaufmann.
[32]
Tseng, F. S.-C., Chen, A. L. P., and Yang, W.-P. 1993. Answering heterogeneous database queries with degrees of uncertainty. Distrib. Parall. Datab. 1, 3, 281--302.
[33]
van Keulen, M. and de Keijzer, A. 2009. Qualitative effects of knowledge rules and user feedback in probabilistic data integration. VLDB J. 18, 5, 1191--1217.
[34]
Wang, Y. R. and Madnick, S. E. 1989. The inter-database instance identification problem in integrating autonomous systems. In Proceedings of ICDE, 1989. IEEE Computer Society, 46--55.
[35]
Widom, J. 2009. Trio: A system for data, uncertainty, and lineage. In Managing and Mining Uncertain Data. Springer.

Cited By

View all

Index Terms

  1. Indeterministic Handling of Uncertain Decisions in Deduplication

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 4, Issue 2
    Special Issue on Entity Resolution
    March 2013
    88 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/2435221
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 2013
    Accepted: 01 February 2012
    Revised: 01 December 2011
    Received: 01 December 2010
    Published in JDIQ Volume 4, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deduplication
    2. Probabilistic Data
    3. Uncertainty

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS)10.1109/ICCNS58795.2023.10193478(65-73)Online publication date: 19-Jun-2023
    • (2022)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_18-2(1-8)Online publication date: 15-Jun-2022
    • (2019)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_18(1308-1315)Online publication date: 20-Feb-2019
    • (2018)Big Data SemanticsJournal on Data Semantics10.1007/s13740-018-0086-27:2(65-85)Online publication date: 23-May-2018
    • (2018)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_18-1(1-9)Online publication date: 12-Feb-2018
    • (2018)Rule-Based Conditioning of Probabilistic DataScalable Uncertainty Management10.1007/978-3-030-00461-3_20(290-305)Online publication date: 11-Sep-2018
    • (2012)Evaluating indeterministic duplicate detection resultsProceedings of the 6th international conference on Scalable Uncertainty Management10.1007/978-3-642-33362-0_33(433-446)Online publication date: 17-Sep-2012

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media