skip to main content
research-article

A Bayesian Approach for Estimating and Replacing Missing Categorical Data

Published:01 June 2009Publication History
Skip Abstract Section

Abstract

We propose a new approach for estimating and replacing missing categorical data. With this approach, the posterior probabilities of a missing attribute value belonging to a certain category are estimated using the simple Bayes method. Two alternative methods for replacing the missing value are proposed: The first replaces the missing value with the value having the estimated maximum probability; the second uses a value that is selected with probability proportional to the estimated posterior distribution. The effectiveness of the proposed approach is evaluated based on some important data quality measures for data warehousing and data mining. The results of the experimental study demonstrate the effectiveness of the proposed approach.

References

  1. Asuncion, A. and Newman, D. J. 2007. UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html.Google ScholarGoogle Scholar
  2. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA.Google ScholarGoogle Scholar
  3. Chen, G. and Astebro, T. 2003. How to deal with missing categorical data: Test of a simple Bayesian method. Organ. Res. Methods 6, 3, 309--327.Google ScholarGoogle ScholarCross RefCross Ref
  4. Chiu, H. Y. and Sedransk, J. 1986. A Bayesian procedure for imputing missing values in sample surveys. J. Amer. Statist. Assoc. 81, 3905, 5667--5676.Google ScholarGoogle ScholarCross RefCross Ref
  5. Clark, P. and Niblett, T. 1989. The CN2 induction algorithm. Mach. Learn. 3, 4, 261--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Codd, E. F. 1979. Extending the database relational model to capture more meaning. ACM Trans. Database Syst. 4, 4, 397--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Congdon, P. 2005. Bayesian Models for Categorical Data. John Wiley & Sons, New York.Google ScholarGoogle Scholar
  8. Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Wiley & Sons, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fan, W., Lu, H., Madnick, S. E., and Cheung, D. 2002. DIRECT: A system for mining data value conversion rules from disparate data sources. Decis. Support Syst. 34, 1, 19--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fung, R. and Del Favero, B. 1995. Applying Bayesian networks to information retrieval. Commun. ACM 38, 5, 42--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jiang, Z., Sarkar, S., De, P., and Dey, D. 2007. A framework for reconciling attribute values from multiple data sources. Manag. Sci. 53, 12, 1946--1963. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Law, A. M. and Kelton, W. D. 1991. Simulation Modeling and Analysis. McGraw-Hill, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Li, X.-B. and Sarkar, S. 2006. Privacy protection in data mining: A perturbation approach for categorical data. Inf. Syst. Res. 17, 3, 254--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Michie, D., Spiegelhalter, D. J., and Taylor, C. C., Eds. 1994. Machine Learning, Neural, and Statistical Classification. Ellis Horwood, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Commun. ACM 45, 4, 211--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pyle, D. 1999. Data Preparation for Data Mining. Morgan Kaufmann, San Mateo, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Quinlan, J. R. 1989. Unknown attribute values in induction. In Proceedings of the 6th International Workshop on Machine Learning. Morgan Kaufmann, San Mateo, CA, 164--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rizvi, S. J. and Haritsa, J. R. 2002. Maintaining data privacy in association rule mining. In Proceedings of the 28th Very Large Data Base Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. SAS Institute, Inc. 1990. SAS Procedure Guide. SAS Institute Inc., Cary, NC.Google ScholarGoogle Scholar
  22. Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann of Elsevier, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zhu, H. and Wang, R. 2008. An information quality framework for verifiable intelligence products. In Data Engineering: Mining, Information, and Intelligence. Y. Chan et al., Eds. Springer, New York. to appear.Google ScholarGoogle Scholar

Index Terms

  1. A Bayesian Approach for Estimating and Replacing Missing Categorical Data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Journal of Data and Information Quality
          Journal of Data and Information Quality  Volume 1, Issue 1
          June 2009
          94 pages
          ISSN:1936-1955
          EISSN:1936-1963
          DOI:10.1145/1515693
          Issue’s Table of Contents

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 June 2009
          Published in jdiq Volume 1, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader