ABSTRACT
Adaptivity is an important feature of data analysis - the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated a general formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis.
Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen ``queries'' about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy?
In this work we make two new contributions towards resolving this question:
We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015; NIPS, 2015).
We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries).
As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that the stability notion guaranteed by differential privacy implies low generalization error. We also show that weaker stability guarantees such as bounded KL divergence and total variation distance lead to correspondingly weaker generalization guarantees.
- {BE02} Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2:499–526, 2002. Google ScholarDigital Library
- {BH95} Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995.Google ScholarCross Ref
- {BH15} Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning competitions. CoRR, abs/1502.04585, 2015.Google Scholar
- {Bon36} Carlo Emilio Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Pubbl. d. R. Ist. Super. di Sci. Econom. e Commerciali di Firenze., 8, 1936.Google Scholar
- {BSSU15} Raef Bassily, Adam Smith, Thomas Steinke, and Jonathan Ullman. More general queries and less generalization error in adaptive data analysis. CoRR, abs/1503.04843, 2015.Google Scholar
- {BST14} Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.Google ScholarDigital Library
- {BUV14} Mark Bun, Jonathan Ullman, and Salil P. Vadhan. Fingerprinting codes and the price of approximate differential privacy. In STOC, pages 1–10. ACM, May 31 – June 3 2014. Google ScholarDigital Library
- {DFH + 15a} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In NIPS, 2015.Google Scholar
- {DFH + 15b} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. In STOC, 2015. Google ScholarDigital Library
- {DFH + 15c} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, June 2015.Google ScholarCross Ref
- {DMNS06} Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284. Springer, March 4-7 2006.Google Scholar
- {DN03} Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In PODS, pages 202–210. ACM, June 9-12 2003. Google ScholarDigital Library
- {DRV10} Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In FOCS, pages 51–60, 2010.Google ScholarDigital Library
- {Dun61} Olive Jean Dunn. Multiple comparisons among means. JASA, 56:52–64, 1961.Google ScholarCross Ref
- {DW79a} Luc Devroye and Terry J. Wagner. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202–207, 1979. Google ScholarDigital Library
- {DW79b} Luc Devroye and Terry J. Wagner. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, 1979. Google ScholarDigital Library
- {Dwo06} Cynthia Dwork. Differential privacy. In ICALP, pages 1–12, 2006.Google ScholarDigital Library
- {GL14} Andrew Gelman and Eric Loken. The statistical crisis in science. American Scientist, 102(6):460, 2014.Google ScholarCross Ref
- {HR10} Moritz Hardt and Guy Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, pages 61–70. IEEE, 2010.Google Scholar
- {HU14} Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, 2014.Google ScholarDigital Library
- {Ioa05} John P. A. Ioannidis. Why most published research findings are false. PLoS Medicine, 2(8):124, August 2005.Google ScholarCross Ref
- {Kea93} Michael J. Kearns. Efficient noise-tolerant learning from statistical queries. In STOC, pages 392–401. ACM, May 16-18 1993. Google ScholarDigital Library
- {KR99} Michael J. Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999. Google ScholarDigital Library
- {McS} Frank McSherry. Differential privacy for measure concentration. Blog post. February 4, 2014. http://windowsontheory.org/. {MT07} Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103. IEEE, Oct 20–23 2007. Google ScholarDigital Library
- {NS15} Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR, abs/1504.05800, 2015.Google Scholar
- {RR10} Aaron Roth and Tim Roughgarden. Interactive privacy via the median mechanism. In STOC, pages 765–774. ACM, June 5–8 2010. Google ScholarDigital Library
- {RZ16} Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. In AISTATS, 2016.Google Scholar
- {SSSS10} Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. JMLR, 11:2635–2670, 2010. Google ScholarDigital Library
- {SU15a} Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. CoRR, abs/1501.06095, 2015.Google Scholar
- {SU15b} Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In COLT, pages 1588–1628, 2015.Google Scholar
- {Ull13} Jonathan Ullman. Answering n 2+o(1) counting queries with differential privacy is hard. In STOC, pages 361–370. ACM, June 1-4 2013. Google ScholarDigital Library
- {Ull15} Jonathan Ullman. Private multiplicative weights beyond linear queries. In PODS. ACM, May 31–June 4 2015. Google ScholarDigital Library
- {WLF16} Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. A minimax theory for adaptive data analysis. arXiv:1602.04287 {stat.ML}, 2016.Google Scholar
Index Terms
- Algorithmic stability for adaptive data analysis
Recommendations
Abel lemma-based finite-sum inequality and its application to stability analysis for linear discrete time-delay systems
This paper is concerned with stability of linear discrete time-delay systems. Note that a tighter estimation on a finite-sum term appearing in the forward difference of some Lyapunov functional leads to a less conservative delay-dependent stability ...
Settling the Query Complexity of Non-adaptive Junta Testing
We prove that any non-adaptive algorithm that tests whether an unknown Boolean function f:{0,1}n→ {0,1} is a k-junta or ϵ-far from every k-junta must make Ω˜(k3/2) / ϵ) many queries for a wide range of parameters k and ϵ. Our result dramatically ...
Stability Analysis of Switched Linear Systems with Uncertainty and Delays
Combining the common quadratic Lyapunov functional approach and free-weighting matrix approach, this paper is devoted to the stability analysis of continuous-time switched linear systems (CTSLS) with uncertainty and time-delays. A particular class of ...
Comments