ABSTRACT
Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical i.i.d. assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected.
- Hastie T., Tibshirani, R. and Friedman, J. H. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. Springer.Google Scholar
- Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-cup 2000 organizers' report: peeling the onion. ACM SIGKDD Explorations Newsletter. 2(2). Google ScholarDigital Library
- Kohavi, R. and Parekh, R. 2003. Ten supplementary analyses to improve e-commerce web sites. In Proceedings of the Fifth WEBKDD Workshop.Google Scholar
- Kohavi, R., Mason L., Parekh, R. and Zheng Z. 2004. Lessons and challenges from mining retail e-commerce data. Machine Learning. 57(1--2). Google ScholarDigital Library
- Lo, A.W. and MacKinlay A.C. 1990. Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies. 3(3) 431--467.Google ScholarCross Ref
- Narayanan, A., Shi, E., and Rubinstein, B. 2011. Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge. Proceedings of the 2011 International Joint Conference on Neural Networks (IJCNN). Preprint.Google Scholar
- Nisbet, R., Elder, J. and Miner, G. 2009. Handbook of Statistical Analysis and Data Mining Applications. Academic Press. Google ScholarDigital Library
- Perlich C., Melville P., Liu Y., Swirszcz G., Lawrence R., Rosset S. 2008. Breast cancer identification: KDD cup winner's report. SIGKDD Explorations Newsletter. 10(2) 39--42. Google ScholarDigital Library
- Pyle, D. 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers. Google ScholarDigital Library
- Pyle, D. 2003. Business Modeling and Data Mining. Morgan Kaufmann Publishers. Google ScholarDigital Library
- Pyle, D. 2009. Data Mining: Know it All. Ch. 9. Morgan Kaufmann Publishers.Google Scholar
- Rosset, S., Perlich, C. and Liu, Y. 2007. Making the most of your data: KDD-Cup 2007 "How Many Ratings" Winner's Report. ACM SIGKDD Explorations Newsletter. 9(2). Google ScholarDigital Library
- Rosset, S., Perlich, C., Swirszcz, G., Liu, Y., and Prem, M. 2010. Medical data mining: lessons from winning two competitions. Data Mining and Knowledge Discovery. 20(3) 439--468. Google ScholarDigital Library
- Tukey, J. 1977. Exploratory Data Analysis. Addison-Wesley.Google Scholar
- Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contexts. Machine Learning. 23(1). Google ScholarDigital Library
- Xie, J. and Coggeshall, S. 2010. Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach. Statistical Analysis and Data Mining, 3: 253--258. Google ScholarDigital Library
Index Terms
- Leakage in data mining: formulation, detection, and avoidance
Recommendations
Leakage in data mining: Formulation, detection, and avoidance
Special Issue on the Best of SIGKDD 2011Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, ...
Mining fuzzy specific rare itemsets for education data
Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Mining uncertain data for constrained frequent sets
IDEAS '09: Proceedings of the 2009 International Database Engineering & Applications SymposiumData mining aims to search for implicit, previously unknown, and potentially useful pieces of information---such as sets of items that are frequently co-occurring together---that are embedded in data. The mined frequent sets can be used in the discovery ...
Comments