skip to main content
research-article
Public Access

Guilt-free data reuse

Published: 24 March 2017 Publication History

Abstract

Existing approaches to ensuring the validity of inferences drawn from data assume a fixed procedure to be performed, selected before the data are examined. Yet the practice of data analysis is an intrinsically interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused.
In this work, we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. We demonstrate new approaches for addressing the challenges of adaptivity that are based on techniques developed in privacy-preserving data analysis.<!-- END_PAGE_1 -->
As an application of our techniques we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced adaptively by a learning algorithm operating on a training set.

References

[1]
Aschwanden, C. Science isn't broken.
[2]
Bassily, R., Nissim, K., Smith, A.D., Steinke, T., Stemmer, U., Ullman, J. Algorithmic stability for adaptive data analysis. In STOC, Cambridge, MA, USA (2016), 1046--1059.
[3]
Benjamini, Y., Hochberg, Y. Controlling the false discovery rate - A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57 (1995), 289--4300.
[4]
Blum, A., Dwork, C., McSherry, F., Nissim, K. Practical privacy: The SuLQ framework. In PODS, Baltimore, Maryland, USA (2005), 128--138.
[5]
Blum, A., Hardt, M. The ladder: A reliable leaderboard for machine learning competitions. In ICML, Lille, France (2015), 1006--1014.
[6]
Bousquet, O., Elisseeff, A. Stability and generalization. JMLR 2 (2002), 499--526.
[7]
Cawley, G.C., Talbot. N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11 (2010), 2079--2107.
[8]
Dwork, C. A firm foundation for private data analysis. CACM 54, 1 (2011), 86--95.
[9]
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A. Preserving statistical validity in adaptive data analysis. CoRR, abs/1411.2664, 2014. Extended abstract in STOC 2015.
[10]
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A. Generalization in adaptive data analysis and holdout reuse. CoRR, abs/1506, 2015. Extended abstract in NIPS 2015.
[11]
Dwork, C., McSherry, F., Nissim, K., Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC, New York, NY, USA (2006), 265--284.
[12]
Dwork, C., Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 34 (2014), 211--407.
[13]
Freedman, D.A. A note on screening regression equations. Am. Statist. 37, 2 (1983), 152--155.
[14]
Gelman, A., Loken, E. The statistical crisis in science. Am. Statist. 102, 6 (2014), 460.
[15]
Hardt, M., Rothblum, G. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, Las Vegas, Nevada, USA (2010), 61--70.
[16]
Hardt, M., Ullman, J. Preventing false discovery in interactive data analysis is hard. In FOCS, Philadelphia, PA, USA (2014), 454--463.
[17]
Hastie, T., Tibshirani, R., Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer-Verlag, New York (2009).
[18]
Ioannidis, J.P.A. Why most published research findings are false. PLoS Med. 2, 8 (Aug. 2005), 124.
[19]
Kearns, M. Efficient noise-tolerant learning from statistical queries. J. ACM 45, 6 (1998), 983--1006.
[20]
Mukherjee, S., Niyogi, P., Poggio, T., Rifkin, R. Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math. 25, 1-3 (2006), 161--193.
[21]
Nissim, K., Stemmer, U. On the generalization properties of differential privacy. CoRR (2015), abs/1504.05800.
[22]
Reunanen, J. Overfitting in making comparisons between variable selection methods. J. Mach Learn Res. 3 (2003), 1371--1382.
[23]
Shalev-Shwartz, S., Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 32 Avenue of the Americas, New York, NY 10013-2473, USA (2014).
[24]
Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K. Learnability, stability and uniform convergence. J. Mach Learn Res. 11 (2010), 2635--2670.
[25]
Steinke, T., Ullman, J. Interactive fingerprinting codes and the hardness of preventing false discovery. In COLT, Paris, France (2015), 1588--1628.

Cited By

View all
  • (2023)GRASP: a goodness-of-fit test for classification learningJournal of the Royal Statistical Society Series B: Statistical Methodology10.1093/jrsssb/qkad10686:1(215-245)Online publication date: 23-Sep-2023
  • (2022)Tighter Bounds on MultiParty Coin Flipping via Augmented Weak Martingales and Differentially Private SamplingSIAM Journal on Computing10.1137/18M121078251:4(1126-1171)Online publication date: 1-Jan-2022
  • (2021)Differentially Private Naïve Bayes Classifier Using Smooth SensitivityProceedings on Privacy Enhancing Technologies10.2478/popets-2021-00772021:4(406-419)Online publication date: 23-Jul-2021
  • Show More Cited By

Index Terms

  1. Guilt-free data reuse

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Communications of the ACM
    Communications of the ACM  Volume 60, Issue 4
    April 2017
    86 pages
    ISSN:0001-0782
    EISSN:1557-7317
    DOI:10.1145/3069398
    • Editor:
    • Moshe Y. Vardi
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 March 2017
    Published in CACM Volume 60, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)328
    • Downloads (Last 6 weeks)55
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)GRASP: a goodness-of-fit test for classification learningJournal of the Royal Statistical Society Series B: Statistical Methodology10.1093/jrsssb/qkad10686:1(215-245)Online publication date: 23-Sep-2023
    • (2022)Tighter Bounds on MultiParty Coin Flipping via Augmented Weak Martingales and Differentially Private SamplingSIAM Journal on Computing10.1137/18M121078251:4(1126-1171)Online publication date: 1-Jan-2022
    • (2021)Differentially Private Naïve Bayes Classifier Using Smooth SensitivityProceedings on Privacy Enhancing Technologies10.2478/popets-2021-00772021:4(406-419)Online publication date: 23-Jul-2021
    • (2021)Computer-Assisted Cohort Identification in PracticeACM Transactions on Computing for Healthcare10.1145/34834113:2(1-28)Online publication date: 20-Dec-2021
    • (2021)Test Data Reuse for the Evaluation of Continuously Evolving Classification Algorithms Using the Area under the Receiver Operating Characteristic CurveSIAM Journal on Mathematics of Data Science10.1137/20M13331103:2(692-714)Online publication date: 3-Jun-2021
    • (2020)Dataset decay and the problem of sequential analyses on open datasetseLife10.7554/eLife.534989Online publication date: 19-May-2020
    • (2020)Incentives Needed for Low-Cost Fair Lateral Data ReuseProceedings of the 2020 ACM-IMS on Foundations of Data Science Conference10.1145/3412815.3416890(71-82)Online publication date: 19-Oct-2020
    • (2020)Splines, Smoothers, and KernelsStatistical Learning from a Regression Perspective10.1007/978-3-030-40189-4_2(73-156)Online publication date: 30-Jun-2020
    • (2019)The Definition of ReuseData Science Journal10.5334/dsj-2019-02218Online publication date: 20-Jun-2019
    • (2019)High Dimensional Restrictive Federated Model Selection with Multi-objective Bayesian Optimization over Shifted DistributionsIntelligent Systems and Applications10.1007/978-3-030-29516-5_48(629-647)Online publication date: 24-Aug-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Digital Edition

    View this article in digital edition.

    Digital Edition

    Magazine Site

    View this article on the magazine site (external)

    Magazine Site

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media