skip to main content
10.1145/3097983.3097992acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Peeking at A/B Tests: Why it matters, and what to do about it

Published: 13 August 2017 Publication History

Abstract

This paper reports on novel statistical methodology, which has been deployed by the commercial A/B testing platform Optimizely to communicate experimental results to their customers. Our methodology addresses the issue that traditional p-values and confidence intervals give unreliable inference. This is because users of A/B testing software are known to continuously monitor these measures as the experiment is running. We provide always valid p-values and confidence intervals that are provably robust to this effect. Not only does this make it safe for a user to continuously monitor, but it empowers her to detect true effects more efficiently. This paper provides simulations and numerical studies on Optimizely's data, demonstrating an improvement in detection performance over traditional methods.

Supplementary Material

MP4 File (walsh_peeking_tests.mp4)

References

[1]
Akshay Balsubramani and Aaditya Ramdas. 2015. Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint arXiv:1506.03486 (2015).
[2]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) (1995), 289--300.
[3]
Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of statistics (2001), 1165--1188.
[4]
Olive Jean Dunn. 1961. Multiple comparisons among means. J. Amer. Statist. Assoc. 56, 293 (1961), 52--64.
[5]
Ronald Aylmer Fisher and others. 1949. The design of experiments. The design of experiments. Ed. 5 (1949).
[6]
Bhaskar Kumar Ghosh and Pranab Kumar Sen. 1991. Handbook of sequential analysis. CRC Press.
[7]
John M Hoenig and Dennis M Heisey. 2001. The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician 55, 1 (2001), 19--24.
[8]
William James and Charles Stein. 1961. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Vol. 1. 361--379.
[9]
Ramesh Johari, Leo Pekelis, and David J. Walsh. 2015. Always Valid Inference: Bringing Sequential Analysis to A/B Testing. (2015). arXiv:arXiv:1512.04922
[10]
Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1168--1176.
[11]
Tze Leung Lai. 1997. On optimal stopping problems in sequential hypothesis testing. Statistica Sinica 7, 1 (1997), 33--51.
[12]
T. L. Lai and D. Siegmund. 1979. A Nonlinear Renewal Theory with Applications to Sequential Analysis II. The Annals of Statistics 7, 1 (01 1979), 60--76.
[13]
Erich Leo Lehmann, Joseph P Romano, and George Casella. 1986. Testing statistical hypotheses. Vol. 150. Wiley New York et al.
[14]
Evan Miller. 2010. How not to run an A/B test. (2010). http://www.evanmiller. org/how-not-to-run-an-ab-test.html Blog post.
[15]
Evan Miller. 2015. Simple Sequential A/B Testing. (2015). http://www.evanmiller. org/sequential-ab-testing.html Blog post.
[16]
Herbert Robbins. 1970. Statistical methods related to the law of the iterated logarithm. The Annals of Mathematical Statistics (1970), 1397--1409.
[17]
H Robbins and D Siegmund. 1974. The expected sample size of some tests of power one. The Annals of Statistics (1974), 415--436.
[18]
Steven L Scott. 2015. Multi-armed bandit experiments in the online serviceeconomy. Applied Stochastic Models in Business and Industry 31, 1 (2015), 37--45.
[19]
David Siegmund. 1985. Sequential analysis: tests and confidence intervals. Springer.
[20]
Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 17--26.
[21]
John W Tukey. 1991. The philosophy of multiple comparisons. Statistical science (1991), 100--116.
[22]
Abraham Wald. 1945. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16, 2 (1945), 117--186.

Cited By

View all
  • (2025)Early Termination for Hyperdimensional Computing Using Inferential StatisticsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707254(342-360)Online publication date: 3-Feb-2025
  • (2024)Time-uniform central limit theory and asymptotic confidence sequencesThe Annals of Statistics10.1214/24-AOS240852:6Online publication date: 1-Dec-2024
  • (2024)Magpie: Improving the Efficiency of A/B Tests for Large Scale Video-on-Demand SystemsProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689019(588-594)Online publication date: 4-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Check for updates

Author Tags

  1. a/b testing
  2. confidence intervals
  3. p-values
  4. sequential hypothesis testing

Qualifiers

  • Research-article

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)373
  • Downloads (Last 6 weeks)43
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Early Termination for Hyperdimensional Computing Using Inferential StatisticsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707254(342-360)Online publication date: 3-Feb-2025
  • (2024)Time-uniform central limit theory and asymptotic confidence sequencesThe Annals of Statistics10.1214/24-AOS240852:6Online publication date: 1-Dec-2024
  • (2024)Magpie: Improving the Efficiency of A/B Tests for Large Scale Video-on-Demand SystemsProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689019(588-594)Online publication date: 4-Nov-2024
  • (2024)Sequential Optimum Test with Multi-armed Bandits for Online ExperimentationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680040(4645-4652)Online publication date: 21-Oct-2024
  • (2024)Best of Three Worlds: Adaptive Experimentation for Digital Marketing in PracticeProceedings of the ACM Web Conference 202410.1145/3589334.3645504(3586-3597)Online publication date: 13-May-2024
  • (2024)Features of A/B Testing of a Delivery App2024 6th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA)10.1109/SUMMA64428.2024.10803747(743-747)Online publication date: 13-Nov-2024
  • (2024)A Quantitative Evaluation Approach for Online Review Quality Based on the AHP-Fuzzy Comprehensive Evaluation2024 11th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA63982.2024.00043(269-279)Online publication date: 2-Nov-2024
  • (2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
  • (2024)A/B Testing Protocol for Digital Educational ResourcesAdvances in Design and Digital Communication V10.1007/978-3-031-77566-6_9(121-135)Online publication date: 24-Dec-2024
  • (2023)Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale DataJournal of Data Science10.6339/23-JDS1099(412-427)Online publication date: 21-Apr-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media