Abstract
We propose a new approach for estimating and replacing missing categorical data. With this approach, the posterior probabilities of a missing attribute value belonging to a certain category are estimated using the simple Bayes method. Two alternative methods for replacing the missing value are proposed: The first replaces the missing value with the value having the estimated maximum probability; the second uses a value that is selected with probability proportional to the estimated posterior distribution. The effectiveness of the proposed approach is evaluated based on some important data quality measures for data warehousing and data mining. The results of the experimental study demonstrate the effectiveness of the proposed approach.
- Asuncion, A. and Newman, D. J. 2007. UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html.Google Scholar
- Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA.Google Scholar
- Chen, G. and Astebro, T. 2003. How to deal with missing categorical data: Test of a simple Bayesian method. Organ. Res. Methods 6, 3, 309--327.Google ScholarCross Ref
- Chiu, H. Y. and Sedransk, J. 1986. A Bayesian procedure for imputing missing values in sample surveys. J. Amer. Statist. Assoc. 81, 3905, 5667--5676.Google ScholarCross Ref
- Clark, P. and Niblett, T. 1989. The CN2 induction algorithm. Mach. Learn. 3, 4, 261--283. Google ScholarDigital Library
- Codd, E. F. 1979. Extending the database relational model to capture more meaning. ACM Trans. Database Syst. 4, 4, 397--434. Google ScholarDigital Library
- Congdon, P. 2005. Bayesian Models for Categorical Data. John Wiley & Sons, New York.Google Scholar
- Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Wiley & Sons, New York. Google ScholarDigital Library
- Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 217--228. Google ScholarDigital Library
- Fan, W., Lu, H., Madnick, S. E., and Cheung, D. 2002. DIRECT: A system for mining data value conversion rules from disparate data sources. Decis. Support Syst. 34, 1, 19--39. Google ScholarDigital Library
- Fung, R. and Del Favero, B. 1995. Applying Bayesian networks to information retrieval. Commun. ACM 38, 5, 42--57. Google ScholarDigital Library
- Jiang, Z., Sarkar, S., De, P., and Dey, D. 2007. A framework for reconciling attribute values from multiple data sources. Manag. Sci. 53, 12, 1946--1963. Google ScholarDigital Library
- Law, A. M. and Kelton, W. D. 1991. Simulation Modeling and Analysis. McGraw-Hill, New York. Google ScholarDigital Library
- Li, X.-B. and Sarkar, S. 2006. Privacy protection in data mining: A perturbation approach for categorical data. Inf. Syst. Res. 17, 3, 254--270. Google ScholarDigital Library
- Michie, D., Spiegelhalter, D. J., and Taylor, C. C., Eds. 1994. Machine Learning, Neural, and Statistical Classification. Ellis Horwood, New York. Google ScholarDigital Library
- Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Commun. ACM 45, 4, 211--218. Google ScholarDigital Library
- Pyle, D. 1999. Data Preparation for Data Mining. Morgan Kaufmann, San Mateo, CA. Google ScholarDigital Library
- Quinlan, J. R. 1989. Unknown attribute values in induction. In Proceedings of the 6th International Workshop on Machine Learning. Morgan Kaufmann, San Mateo, CA, 164--168. Google ScholarDigital Library
- Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Google ScholarDigital Library
- Rizvi, S. J. and Haritsa, J. R. 2002. Maintaining data privacy in association rule mining. In Proceedings of the 28th Very Large Data Base Conference. Google ScholarDigital Library
- SAS Institute, Inc. 1990. SAS Procedure Guide. SAS Institute Inc., Cary, NC.Google Scholar
- Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann of Elsevier, San Francisco, CA. Google ScholarDigital Library
- Zhu, H. and Wang, R. 2008. An information quality framework for verifiable intelligence products. In Data Engineering: Mining, Information, and Intelligence. Y. Chan et al., Eds. Springer, New York. to appear.Google Scholar
Index Terms
A Bayesian Approach for Estimating and Replacing Missing Categorical Data
Recommendations
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
In today’s data-rich environment, decision makers draw conclusions from data repositories that may contain data quality problems. In this context, missing data is an important and known problem, since it can seriously affect the accuracy of conclusions ...
A reinforcement learning-based approach for imputing missing data
AbstractMissing data is a major problem in real-world datasets, which hinders the performance of data analytics. Conventional data imputation schemes such as univariate single imputation replace missing values in each column with the same approximated ...
Four Factors Affecting Missing Data Imputation
SSDBM '23: Proceedings of the 35th International Conference on Scientific and Statistical Database ManagementMissing data is a common problem in datasets and impacts the reliability of data analysis. Numerous methods to impute (i.e., predict and replace) missing values have been proposed. The quality of these imputed values depends on factors like correlation,...
Comments