skip to main content
research-article

Leakage in data mining: Formulation, detection, and avoidance

Published: 18 December 2012 Publication History

Abstract

Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected. We also offer an alternative point of view on leakage that is based on causal graph modeling concepts.

References

[1]
Buneman, P., Khanna, S., and Wang-Chiew, T. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory (ICDT'01). 316--330.
[2]
Engle, R. and Granger, C. 1987. Co-Integration and error correction: Representation, estimation and testing. Econometrica 55, 2, 251--276.
[3]
Hastie, T., Tibshirani, R., and Friedman, J. H. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer.
[4]
Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. Kdd-Cup 2000 organizers' report: Peeling the onion. ACM SIGKDD Explor. Newslett. 2.
[5]
Kohavi, R., Mason, L., Parekh, R., and Zheng, Z. 2004. Lessons and challenges from mining retail e-commerce data. Mach. Learn. 1, 2.
[6]
Kohavi, R. and Parekh, R. 2003. Ten supplementary analyses to improve e-commerce web sites. In Proceedings of the 5th WEBKDD Workshop.
[7]
Laiy, S., Xiang, L., Diao, R., Liu, Y., Gu, H., Xu, L., Li, H., Wang, D., Liu, K., Zhao, J., et al. 2011. Hybrid recommendation models for binary user preference prediction problem. http://jmir.csail.mit.edu/proceedings/papers/v18/lai12a/lai12a.pdf.
[8]
Lo, A. and MacKinlay, A. 1990. Data-Snooping biases in tests of financial asset pricing models. Rev. Finan. Stud. 1, 431--467.
[9]
Narayanan, A., Shi, E., and Rubinstein, B. 2011. Link prediction by deanonymization: How we won the kaggle social network challenge. In Proceedings of the International Joint Conference on Neural Networks (IJCNN'11).
[10]
Nisbet, R., Elder, J., and Miner, G. 2009. Handbook of Statistical Analysis and Data Mining Applications. Academic Press.
[11]
Pearl, J. 1995. Causal diagrams for empirical research. Biometrika 82, 4, 669--688.
[12]
Pearl, J. 2009. Causality: Models, Reasoning and Inference 2nd Ed. Cambridge University Press, Cambridge, UK.
[13]
Perlich, C., Melville, P., Liu, Y., Swirszcz, G., Lawrence, R., and Rosset, S. 2008. Breast cancer identification: KDD cup winner's report. ACM SIGKDD Explor. Newslett. 2, 39--42.
[14]
Pyle, D. 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers.
[15]
Pyle, D. 2003. Business Modeling and Data Mining. Morgan Kaufmann Publishers.
[16]
Pyle, D. 2009. Data Mining: Know it All. Morgan Kaufmann Publishers, Chapter 9.
[17]
Robins, J. 1997. Causal inference from complex longitudinal data. Lat. Variab. Model. Appl. Causal. 120, 69--117.
[18]
Rosset, S., Perlich, C., and Liu, Y. 2007. Making the most of your data: KDD cup 2007 “how many ratings” winner's report. ACM SIGKDD Explor. Newslett. 2.
[19]
Rosset, S., Perlich, C., Swirszcz, G., Liu, Y., and Melville, P. 2010. Medical data mining: Lessons from winning two competitions. Data Min. Knowl. Discov. 3, 439--468.
[20]
Smith, A. and Elkan, C. 2007. Making generative classifiers robust to selection bias. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 657--666.
[21]
Tukey, J. 1977. Exploratory Data Analysis. Addison-Wesley.
[22]
Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contexts. Mach. Learn. 1.
[23]
Xie, J. and Coggeshall, S. 2010. Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach. Stastist. Anal. Data Min. 3, 4, 253-258.

Cited By

View all
  • (2025)An intelligent decision-making system for embryo transfer in reproductive technology: a machine learning-based approachSystems Biology in Reproductive Medicine10.1080/19396368.2024.244583171:1(13-28)Online publication date: 28-Jan-2025
  • (2025)Intelligent diagnosis of Kawasaki disease from real-world data using interpretable machine learning modelsHellenic Journal of Cardiology10.1016/j.hjc.2024.08.00381(38-48)Online publication date: Jan-2025
  • (2025)Revisiting Neuroimaging Endophenotypes in the Era of Machine Learning: The Key Role of Clinical Measures in Identifying Risk for Bipolar DisorderBiological Psychiatry10.1016/j.biopsych.2024.11.00597:3(215-216)Online publication date: Feb-2025
  • Show More Cited By

Index Terms

  1. Leakage in data mining: Formulation, detection, and avoidance

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 6, Issue 4
    Special Issue on the Best of SIGKDD 2011
    December 2012
    141 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/2382577
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 December 2012
    Accepted: 01 May 2012
    Revised: 01 March 2012
    Received: 01 November 2011
    Published in TKDD Volume 6, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data mining
    2. leakage
    3. predictive modeling

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)367
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)An intelligent decision-making system for embryo transfer in reproductive technology: a machine learning-based approachSystems Biology in Reproductive Medicine10.1080/19396368.2024.244583171:1(13-28)Online publication date: 28-Jan-2025
    • (2025)Intelligent diagnosis of Kawasaki disease from real-world data using interpretable machine learning modelsHellenic Journal of Cardiology10.1016/j.hjc.2024.08.00381(38-48)Online publication date: Jan-2025
    • (2025)Revisiting Neuroimaging Endophenotypes in the Era of Machine Learning: The Key Role of Clinical Measures in Identifying Risk for Bipolar DisorderBiological Psychiatry10.1016/j.biopsych.2024.11.00597:3(215-216)Online publication date: Feb-2025
    • (2024)A tutorial on open-source large language models for behavioral scienceBehavior Research Methods10.3758/s13428-024-02455-856:8(8214-8237)Online publication date: 15-Aug-2024
    • (2024)Applying Recurrent Neural Networks and Blocked Cross-Validation to Model Conventional Drinking Water Treatment ProcessesWater10.3390/w1607104216:7(1042)Online publication date: 4-Apr-2024
    • (2024)X-ray Fluorescence Core Scanning for High-Resolution Geochemical Characterisation of SoilsSoil Systems10.3390/soilsystems80200568:2(56)Online publication date: 17-May-2024
    • (2024)Mitigating Data Leakage in a WiFi CSI Benchmark for Human Action RecognitionSensors10.3390/s2424820124:24(8201)Online publication date: 22-Dec-2024
    • (2024)AriAplBud: An Aerial Multi-Growth Stage Apple Flower Bud Dataset for Agricultural Object Detection BenchmarkingData10.3390/data90200369:2(36)Online publication date: 11-Feb-2024
    • (2024)Data leakage in deep learning studies of translational EEGFrontiers in Neuroscience10.3389/fnins.2024.137351518Online publication date: 3-May-2024
    • (2024)Tropical cyclone intensity estimation through convolutional neural network transfer learning using two geostationary satellite datasetsFrontiers in Earth Science10.3389/feart.2023.128513811Online publication date: 19-Jan-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media