research-article

Public Access

Algorithmic stability for adaptive data analysis

Authors:
Raef Bassily

University of California at San Diego, USA

University of California at San Diego, USA
View Profile

,
Kobbi Nissim

Ben-Gurion University of the Negev, Israel / Harvard University, USA

Ben-Gurion University of the Negev, Israel / Harvard University, USA
View Profile

,
Adam Smith

Pennsylvania State University, USA

Pennsylvania State University, USA
View Profile

,
Thomas Steinke

Harvard University, USA

Harvard University, USA
View Profile

,
Uri Stemmer

Ben-Gurion University of the Negev, Israel

Ben-Gurion University of the Negev, Israel
View Profile

,
Jonathan Ullman

Northeastern University, USA

Northeastern University, USA
View Profile

STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of ComputingJune 2016Pages 1046–1059https://doi.org/10.1145/2897518.2897566

Published:19 June 2016Publication History

STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

Pages 1046–1059

ABSTRACT

Adaptivity is an important feature of data analysis - the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated a general formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis.

Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen ``queries'' about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy?

In this work we make two new contributions towards resolving this question:

We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015; NIPS, 2015).
We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries).

As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that the stability notion guaranteed by differential privacy implies low generalization error. We also show that weaker stability guarantees such as bounded KL divergence and total variation distance lead to correspondingly weaker generalization guarantees.

References

{BE02} Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2:499–526, 2002. Google ScholarDigital Library
{BH95} Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995.Google ScholarCross Ref
{BH15} Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning competitions. CoRR, abs/1502.04585, 2015.Google Scholar
{Bon36} Carlo Emilio Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Pubbl. d. R. Ist. Super. di Sci. Econom. e Commerciali di Firenze., 8, 1936.Google Scholar
{BSSU15} Raef Bassily, Adam Smith, Thomas Steinke, and Jonathan Ullman. More general queries and less generalization error in adaptive data analysis. CoRR, abs/1503.04843, 2015.Google Scholar
{BST14} Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.Google ScholarDigital Library
{BUV14} Mark Bun, Jonathan Ullman, and Salil P. Vadhan. Fingerprinting codes and the price of approximate differential privacy. In STOC, pages 1–10. ACM, May 31 – June 3 2014. Google ScholarDigital Library
{DFH + 15a} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In NIPS, 2015.Google Scholar
{DFH + 15b} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. In STOC, 2015. Google ScholarDigital Library
{DFH + 15c} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, June 2015.Google ScholarCross Ref
{DMNS06} Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284. Springer, March 4-7 2006.Google Scholar
{DN03} Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In PODS, pages 202–210. ACM, June 9-12 2003. Google ScholarDigital Library
{DRV10} Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In FOCS, pages 51–60, 2010.Google ScholarDigital Library
{Dun61} Olive Jean Dunn. Multiple comparisons among means. JASA, 56:52–64, 1961.Google ScholarCross Ref
{DW79a} Luc Devroye and Terry J. Wagner. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202–207, 1979. Google ScholarDigital Library
{DW79b} Luc Devroye and Terry J. Wagner. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, 1979. Google ScholarDigital Library
{Dwo06} Cynthia Dwork. Differential privacy. In ICALP, pages 1–12, 2006.Google ScholarDigital Library
{GL14} Andrew Gelman and Eric Loken. The statistical crisis in science. American Scientist, 102(6):460, 2014.Google ScholarCross Ref
{HR10} Moritz Hardt and Guy Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, pages 61–70. IEEE, 2010.Google Scholar
{HU14} Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, 2014.Google ScholarDigital Library
{Ioa05} John P. A. Ioannidis. Why most published research findings are false. PLoS Medicine, 2(8):124, August 2005.Google ScholarCross Ref
{Kea93} Michael J. Kearns. Efficient noise-tolerant learning from statistical queries. In STOC, pages 392–401. ACM, May 16-18 1993. Google ScholarDigital Library
{KR99} Michael J. Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999. Google ScholarDigital Library
{McS} Frank McSherry. Differential privacy for measure concentration. Blog post. February 4, 2014. http://windowsontheory.org/. {MT07} Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103. IEEE, Oct 20–23 2007. Google ScholarDigital Library
{NS15} Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR, abs/1504.05800, 2015.Google Scholar
{RR10} Aaron Roth and Tim Roughgarden. Interactive privacy via the median mechanism. In STOC, pages 765–774. ACM, June 5–8 2010. Google ScholarDigital Library
{RZ16} Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. In AISTATS, 2016.Google Scholar
{SSSS10} Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. JMLR, 11:2635–2670, 2010. Google ScholarDigital Library
{SU15a} Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. CoRR, abs/1501.06095, 2015.Google Scholar
{SU15b} Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In COLT, pages 1588–1628, 2015.Google Scholar
{Ull13} Jonathan Ullman. Answering n 2+o(1) counting queries with differential privacy is hard. In STOC, pages 361–370. ACM, June 1-4 2013. Google ScholarDigital Library
{Ull15} Jonathan Ullman. Private multiplicative weights beyond linear queries. In PODS. ACM, May 31–June 4 2015. Google ScholarDigital Library
{WLF16} Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. A minimax theory for adaptive data analysis. arXiv:1602.04287 {stat.ML}, 2016.Google Scholar

Index Terms

Algorithmic stability for adaptive data analysis

Recommendations

Abel lemma-based finite-sum inequality and its application to stability analysis for linear discrete time-delay systems

This paper is concerned with stability of linear discrete time-delay systems. Note that a tighter estimation on a finite-sum term appearing in the forward difference of some Lyapunov functional leads to a less conservative delay-dependent stability ...
Read More
Settling the Query Complexity of Non-adaptive Junta Testing

We prove that any non-adaptive algorithm that tests whether an unknown Boolean function f:{0,1}ⁿ→ {0,1} is a k-junta or ϵ-far from every k-junta must make Ω^˜(k^3/2) / ϵ) many queries for a wide range of parameters k and ϵ. Our result dramatically ...
Read More
Stability Analysis of Switched Linear Systems with Uncertainty and Delays

Combining the common quadratic Lyapunov functional approach and free-weighting matrix approach, this paper is devoted to the stability analysis of continuous-time switched linear systems (CTSLS) with uncertainty and time-delays. A particular class of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing
June 2016
1141 pages
ISBN:9781450341325
DOI:10.1145/2897518
General Chair:
Daniel Wichs
Northeastern, USA
,
Program Chair:
Yishay Mansour
Tel Aviv
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Adaptivity
Data Analysis
Differential Privacy
Generalization
Stability
Statistics
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,469of4,586submissions,32%
Upcoming Conference
STOC '24

Sponsor:

sigact

56th Annual ACM Symposium on Theory of Computing (STOC 2024)

June 24 - 28, 2024

Vancouver , BC , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 61
  Total Citations
  View Citations
- 1,971
  Total Downloads
- Downloads (Last 12 months)299
- Downloads (Last 6 weeks)37
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Algorithmic stability for adaptive data analysis

STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Abel lemma-based finite-sum inequality and its application to stability analysis for linear discrete time-delay systems

Settling the Query Complexity of Non-adaptive Junta Testing

Stability Analysis of Switched Linear Systems with Uncertainty and Delays

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Algorithmic stability for adaptive data analysis

STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Abel lemma-based finite-sum inequality and its application to stability analysis for linear discrete time-delay systems

Settling the Query Complexity of Non-adaptive Junta Testing

Stability Analysis of Switched Linear Systems with Uncertainty and Delays

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media