research-article

Peeking at A/B Tests: Why it matters, and what to do about it

Authors:

Leonid Pekelis,

David WalshAuthors Info & Claims

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1517 - 1525

https://doi.org/10.1145/3097983.3097992

Published: 13 August 2017 Publication History

Abstract

This paper reports on novel statistical methodology, which has been deployed by the commercial A/B testing platform Optimizely to communicate experimental results to their customers. Our methodology addresses the issue that traditional p-values and confidence intervals give unreliable inference. This is because users of A/B testing software are known to continuously monitor these measures as the experiment is running. We provide always valid p-values and confidence intervals that are provably robust to this effect. Not only does this make it safe for a user to continuously monitor, but it empowers her to detect true effects more efficiently. This paper provides simulations and numerical studies on Optimizely's data, demonstrating an improvement in detection performance over traditional methods.

Supplementary Material

MP4 File (walsh_peeking_tests.mp4)

Download
404.27 MB

References

[1]

Akshay Balsubramani and Aaditya Ramdas. 2015. Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint arXiv:1506.03486 (2015).

[2]

Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) (1995), 289--300.

[3]

Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of statistics (2001), 1165--1188.

[4]

Olive Jean Dunn. 1961. Multiple comparisons among means. J. Amer. Statist. Assoc. 56, 293 (1961), 52--64.

[5]

Ronald Aylmer Fisher and others. 1949. The design of experiments. The design of experiments. Ed. 5 (1949).

[6]

Bhaskar Kumar Ghosh and Pranab Kumar Sen. 1991. Handbook of sequential analysis. CRC Press.

[7]

John M Hoenig and Dennis M Heisey. 2001. The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician 55, 1 (2001), 19--24.

[8]

William James and Charles Stein. 1961. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Vol. 1. 361--379.

[9]

Ramesh Johari, Leo Pekelis, and David J. Walsh. 2015. Always Valid Inference: Bringing Sequential Analysis to A/B Testing. (2015). arXiv:arXiv:1512.04922

[10]

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1168--1176.

Digital Library

[11]

Tze Leung Lai. 1997. On optimal stopping problems in sequential hypothesis testing. Statistica Sinica 7, 1 (1997), 33--51.

[12]

T. L. Lai and D. Siegmund. 1979. A Nonlinear Renewal Theory with Applications to Sequential Analysis II. The Annals of Statistics 7, 1 (01 1979), 60--76.

[13]

Erich Leo Lehmann, Joseph P Romano, and George Casella. 1986. Testing statistical hypotheses. Vol. 150. Wiley New York et al.

[14]

Evan Miller. 2010. How not to run an A/B test. (2010). http://www.evanmiller. org/how-not-to-run-an-ab-test.html Blog post.

[15]

Evan Miller. 2015. Simple Sequential A/B Testing. (2015). http://www.evanmiller. org/sequential-ab-testing.html Blog post.

[16]

Herbert Robbins. 1970. Statistical methods related to the law of the iterated logarithm. The Annals of Mathematical Statistics (1970), 1397--1409.

[17]

H Robbins and D Siegmund. 1974. The expected sample size of some tests of power one. The Annals of Statistics (1974), 415--436.

[18]

Steven L Scott. 2015. Multi-armed bandit experiments in the online serviceeconomy. Applied Stochastic Models in Business and Industry 31, 1 (2015), 37--45.

Digital Library

[19]

David Siegmund. 1985. Sequential analysis: tests and confidence intervals. Springer.

[20]

Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 17--26.

Digital Library

[21]

John W Tukey. 1991. The philosophy of multiple comparisons. Statistical science (1991), 100--116.

[22]

Abraham Wald. 1945. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16, 2 (1945), 117--186.

Cited By

Yi PYang YLee CAchour SEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Early Termination for Hyperdimensional Computing Using Inferential StatisticsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707254(342-360)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707254
Waudby-Smith IArbour DSinha RKennedy ERamdas A(2024)Time-uniform central limit theory and asymptotic confidence sequencesThe Annals of Statistics10.1214/24-AOS240852:6Online publication date: 1-Dec-2024
https://doi.org/10.1214/24-AOS2408
Yu HWang HTian CSathyanarayana SShi SXue ZYu SLi HPeng YPang XZhang RVallina-Rodríguez NSuarez-Tángil GLevin DPelsser C(2024)Magpie: Improving the Efficiency of A/B Tests for Large Scale Video-on-Demand SystemsProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689019(588-594)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3646547.3689019
Show More Cited By

Index Terms

Peeking at A/B Tests: Why it matters, and what to do about it
1. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic inference problems
      1. Bayesian computation
      2. Maximum likelihood estimation
2. Theory of computation
  1. Design and analysis of algorithms
  2. Theory and algorithms for application domains
    1. Machine learning theory
      1. Bayesian analysis

Recommendations

Always Valid Inference: Continuous Monitoring of A/B Tests
A/B tests are typically analyzed via frequentist p-values and confidence intervals, but these inferences are wholly unreliable if users endogenously choose samples sizes by continuously monitoring their tests. We define always valid p-values and ...
The Probability that Your Hypothesis Is Correct, Credible Intervals, and Effect Sizes for IR Evaluation
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Using classical statistical significance tests, researchers can only discuss P(D⁺|H), the probability of observing the data D at hand or something more extreme, under the assumption that the hypothesis H is true (i.e., the p-value). But what we usually ...
Anytime-Valid Confidence Sequences in an Enterprise A/B Testing Platform
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

A/B tests are the gold standard for evaluating digital experiences on the web. However, traditional “fixed-horizon” statistical methods are often incompatible with the needs of modern industry practitioners as they do not permit continuous monitoring of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2017

2240 pages

ISBN:9781450348874

DOI:10.1145/3097983

General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM

Copyright © 2017 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '17

Sponsor:

KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2017

NS, Halifax, Canada

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
3,713
Total Downloads

Downloads (Last 12 months)373
Downloads (Last 6 weeks)43

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yi PYang YLee CAchour SEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Early Termination for Hyperdimensional Computing Using Inferential StatisticsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707254(342-360)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707254
Waudby-Smith IArbour DSinha RKennedy ERamdas A(2024)Time-uniform central limit theory and asymptotic confidence sequencesThe Annals of Statistics10.1214/24-AOS240852:6Online publication date: 1-Dec-2024
https://doi.org/10.1214/24-AOS2408
Yu HWang HTian CSathyanarayana SShi SXue ZYu SLi HPeng YPang XZhang RVallina-Rodríguez NSuarez-Tángil GLevin DPelsser C(2024)Magpie: Improving the Efficiency of A/B Tests for Large Scale Video-on-Demand SystemsProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689019(588-594)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3646547.3689019
Kong FZhao PHan SWang YLi SSerra ESpezzano F(2024)Sequential Optimum Test with Multi-armed Bandits for Online ExperimentationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680040(4645-4652)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3680040
Fiez TNassif HChen YGamez SJain LChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Best of Three Worlds: Adaptive Experimentation for Digital Marketing in PracticeProceedings of the ACM Web Conference 202410.1145/3589334.3645504(3586-3597)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645504
Khitskova YAsnina NEfimova OMakovy K(2024)Features of A/B Testing of a Delivery App2024 6th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA)10.1109/SUMMA64428.2024.10803747(743-747)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SUMMA64428.2024.10803747
Tang MQin YZhang LLu HZhang T(2024)A Quantitative Evaluation Approach for Online Review Quality Based on the AHP-Fuzzy Comprehensive Evaluation2024 11th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA63982.2024.00043(269-279)Online publication date: 2-Nov-2024
https://doi.org/10.1109/DSA63982.2024.00043
Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011
Reis INunes ERodrigues R(2024)A/B Testing Protocol for Digital Educational ResourcesAdvances in Design and Digital Communication V10.1007/978-3-031-77566-6_9(121-135)Online publication date: 24-Dec-2024
https://doi.org/10.1007/978-3-031-77566-6_9
Zhou WKroehl MMeier MKaizer A(2023)Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale DataJournal of Data Science10.6339/23-JDS1099(412-427)Online publication date: 21-Apr-2023
https://doi.org/10.6339/23-JDS1099
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten