skip to main content
10.1145/2020408.2020496acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Leakage in data mining: formulation, detection, and avoidance

Published:21 August 2011Publication History

ABSTRACT

Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical i.i.d. assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected.

References

  1. Hastie T., Tibshirani, R. and Friedman, J. H. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. Springer.Google ScholarGoogle Scholar
  2. Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-cup 2000 organizers' report: peeling the onion. ACM SIGKDD Explorations Newsletter. 2(2). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Kohavi, R. and Parekh, R. 2003. Ten supplementary analyses to improve e-commerce web sites. In Proceedings of the Fifth WEBKDD Workshop.Google ScholarGoogle Scholar
  4. Kohavi, R., Mason L., Parekh, R. and Zheng Z. 2004. Lessons and challenges from mining retail e-commerce data. Machine Learning. 57(1--2). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Lo, A.W. and MacKinlay A.C. 1990. Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies. 3(3) 431--467.Google ScholarGoogle ScholarCross RefCross Ref
  6. Narayanan, A., Shi, E., and Rubinstein, B. 2011. Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge. Proceedings of the 2011 International Joint Conference on Neural Networks (IJCNN). Preprint.Google ScholarGoogle Scholar
  7. Nisbet, R., Elder, J. and Miner, G. 2009. Handbook of Statistical Analysis and Data Mining Applications. Academic Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Perlich C., Melville P., Liu Y., Swirszcz G., Lawrence R., Rosset S. 2008. Breast cancer identification: KDD cup winner's report. SIGKDD Explorations Newsletter. 10(2) 39--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Pyle, D. 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Pyle, D. 2003. Business Modeling and Data Mining. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Pyle, D. 2009. Data Mining: Know it All. Ch. 9. Morgan Kaufmann Publishers.Google ScholarGoogle Scholar
  12. Rosset, S., Perlich, C. and Liu, Y. 2007. Making the most of your data: KDD-Cup 2007 "How Many Ratings" Winner's Report. ACM SIGKDD Explorations Newsletter. 9(2). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rosset, S., Perlich, C., Swirszcz, G., Liu, Y., and Prem, M. 2010. Medical data mining: lessons from winning two competitions. Data Mining and Knowledge Discovery. 20(3) 439--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Tukey, J. 1977. Exploratory Data Analysis. Addison-Wesley.Google ScholarGoogle Scholar
  15. Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contexts. Machine Learning. 23(1). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xie, J. and Coggeshall, S. 2010. Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach. Statistical Analysis and Data Mining, 3: 253--258. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Leakage in data mining: formulation, detection, and avoidance

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2011
          1446 pages
          ISBN:9781450308137
          DOI:10.1145/2020408

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 August 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader