ACM Home Page
Please provide us with feedback. Feedback
The problem of disguised missing data
Full text PdfPdf (1.94 MB)
Source ACM SIGKDD Explorations Newsletter archive
Volume 8 ,  Issue 1  (June 2006) table of contents
Pages: 83 - 92  
Year of Publication: 2006
ISSN:1931-0145
Author
Ronald K. Pearson  ProSanos Corporation, Harrisburg, PA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 50,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1147234.1147247
What is a DOI?

ABSTRACT

Missing data is a well-recognized problem in large datasets, widely discussed in the statistics and data analysis literature. Many programming environments provide explicit codes for missing data, but these are not standardized and are not always used. This lack of standardization is one of the leading causes of the subtle problem of disguised missing data, in which unknown, inapplicable, or otherwise nonspecified responses are encoded as valid data values. Following a brief overview of the problem of explicitly coded missing data, this paper discusses sources, consequences, and detection of disguised missing data, including two real-world examples. As the first of these examples illustrates, the consequences of disguised missing data can be quite serious. The key to its detection lies in first, recognizing disguised missing data as a possibility and second, finding a sufficiently informative view of the data to reveal its presence.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
V. Barnett and T. Lewis. Outliers in Statistical Data. Wiley, 3rd edition, 1994.
 
3
J. Breault. Data mining diabetic databases: Are rough sets a useful addition? In Proc. 33rd Symposium on the Interface, Computing Science and Statistics, Fairfax, VA, 2001.
 
4
 
5
L. Breiman. Heuristics of instability and stabilization in model selection. Ann. Statist., 24:2350--2383, 1996.
 
6
 
7
D. DesJardins. Outliers, inliers, and just plain liars---new EDA+ (EDA Plus) techniques for understanding data. In Proc. SAS User's Group Intl. Conf., SUGI26, Long Beach, CA, 2001. Paper 169.
 
8
 
9
D. Heitjan. Ignorability and coarse data: Some biomedical examples. Biometrics, 49:1099--1109, 1993.
 
10
D. Heitjan and D. Rubin. Ignorability and coarse data. Ann. Statist., 19:2244--2253, 1991.
 
11
N. Horton and S. Lipsitz. Multiple imputation in practice: Comparison of software packages for regression models with missing variables. American Statistician, 55:244--254, 2001.
 
12
M. Huisman. Missing data in behavioral science research: Investigation of a collection of datasets. Kwantitatieve Methoden, 57:69--93, 1998. (in English).
 
13
M. Huisman. Post-stratification to correct for nonresponse: classification of zip code areas. In J. Bethlehem and P. van der Heijden, editors, Proc. 14th Symposium Computational Statistics, COMPSTAT 2000, pages 325--330, Utrecht, 2000.
 
14
M. Huisman and J. van der Zouwen. Item nonresponse in scale data from surveys: Types, determinants, and measures. Technical report, University of Groningen, 1998.
 
15
M. Jaeger. Ignorability for categorical data. Ann. Statist, 33:1964--1981, 2005.
 
16
G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In W. Cohen and H. Hirsch, editors, Machine Learning: Proc. 11th International Conf., pages 121--129, 1994.
 
17
 
18
G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, 1997.
 
19
J. Mistiaen and M. Ravallion. Survey compliance and the distribution of income. Policy Research Working Paper WPS2956, The World Bank, Development Research Group, Poverty Team, available at http://econ.worldbank.org, 2003.
20
 
21
 
22
J. Schafer and J. Graham. Missing data: Our view of the state of the art. Psychological Methods, 7:147--177, 2002.
 
23
A. Trontell. How the US Food and Drug Administration defines and detects adverse drug events. Current Therapeutic Research, 62:641--649, 2001.
 
24
W. Venables and B. Ripley. Modern Applied Statistics with S. Springer, 2002.
 
25
Y. Wei, K. Detre, and J. Everhart. The NIDDK liver transplantation database. Liver Transplant Surgery, 3:10--22, 1997.