skip to main content
10.1145/2897518.2897566acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article
Public Access

Algorithmic stability for adaptive data analysis

Published:19 June 2016Publication History

ABSTRACT

Adaptivity is an important feature of data analysis - the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated a general formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis.

Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen ``queries'' about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy?

In this work we make two new contributions towards resolving this question:

  • We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015; NIPS, 2015).

  • We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries).

As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that the stability notion guaranteed by differential privacy implies low generalization error. We also show that weaker stability guarantees such as bounded KL divergence and total variation distance lead to correspondingly weaker generalization guarantees.

References

  1. {BE02} Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2:499–526, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. {BH95} Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  3. {BH15} Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning competitions. CoRR, abs/1502.04585, 2015.Google ScholarGoogle Scholar
  4. {Bon36} Carlo Emilio Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Pubbl. d. R. Ist. Super. di Sci. Econom. e Commerciali di Firenze., 8, 1936.Google ScholarGoogle Scholar
  5. {BSSU15} Raef Bassily, Adam Smith, Thomas Steinke, and Jonathan Ullman. More general queries and less generalization error in adaptive data analysis. CoRR, abs/1503.04843, 2015.Google ScholarGoogle Scholar
  6. {BST14} Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {BUV14} Mark Bun, Jonathan Ullman, and Salil P. Vadhan. Fingerprinting codes and the price of approximate differential privacy. In STOC, pages 1–10. ACM, May 31 – June 3 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {DFH + 15a} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In NIPS, 2015.Google ScholarGoogle Scholar
  9. {DFH + 15b} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. In STOC, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {DFH + 15c} Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, June 2015.Google ScholarGoogle ScholarCross RefCross Ref
  11. {DMNS06} Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284. Springer, March 4-7 2006.Google ScholarGoogle Scholar
  12. {DN03} Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In PODS, pages 202–210. ACM, June 9-12 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {DRV10} Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In FOCS, pages 51–60, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. {Dun61} Olive Jean Dunn. Multiple comparisons among means. JASA, 56:52–64, 1961.Google ScholarGoogle ScholarCross RefCross Ref
  15. {DW79a} Luc Devroye and Terry J. Wagner. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202–207, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. {DW79b} Luc Devroye and Terry J. Wagner. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {Dwo06} Cynthia Dwork. Differential privacy. In ICALP, pages 1–12, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. {GL14} Andrew Gelman and Eric Loken. The statistical crisis in science. American Scientist, 102(6):460, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  19. {HR10} Moritz Hardt and Guy Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, pages 61–70. IEEE, 2010.Google ScholarGoogle Scholar
  20. {HU14} Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. {Ioa05} John P. A. Ioannidis. Why most published research findings are false. PLoS Medicine, 2(8):124, August 2005.Google ScholarGoogle ScholarCross RefCross Ref
  22. {Kea93} Michael J. Kearns. Efficient noise-tolerant learning from statistical queries. In STOC, pages 392–401. ACM, May 16-18 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. {KR99} Michael J. Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {McS} Frank McSherry. Differential privacy for measure concentration. Blog post. February 4, 2014. http://windowsontheory.org/. {MT07} Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103. IEEE, Oct 20–23 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. {NS15} Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR, abs/1504.05800, 2015.Google ScholarGoogle Scholar
  26. {RR10} Aaron Roth and Tim Roughgarden. Interactive privacy via the median mechanism. In STOC, pages 765–774. ACM, June 5–8 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. {RZ16} Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. In AISTATS, 2016.Google ScholarGoogle Scholar
  28. {SSSS10} Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. JMLR, 11:2635–2670, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. {SU15a} Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. CoRR, abs/1501.06095, 2015.Google ScholarGoogle Scholar
  30. {SU15b} Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In COLT, pages 1588–1628, 2015.Google ScholarGoogle Scholar
  31. {Ull13} Jonathan Ullman. Answering n 2+o(1) counting queries with differential privacy is hard. In STOC, pages 361–370. ACM, June 1-4 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. {Ull15} Jonathan Ullman. Private multiplicative weights beyond linear queries. In PODS. ACM, May 31–June 4 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. {WLF16} Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. A minimax theory for adaptive data analysis. arXiv:1602.04287 {stat.ML}, 2016.Google ScholarGoogle Scholar

Index Terms

  1. Algorithmic stability for adaptive data analysis

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing
          June 2016
          1141 pages
          ISBN:9781450341325
          DOI:10.1145/2897518

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 June 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,469of4,586submissions,32%

          Upcoming Conference

          STOC '24
          56th Annual ACM Symposium on Theory of Computing (STOC 2024)
          June 24 - 28, 2024
          Vancouver , BC , Canada

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader