research-article

Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments

Authors:
Roman Budylin

Yandex, Moscow, Russian Fed.

Yandex, Moscow, Russian Fed.
View Profile

,
Alexey Drutsa

Yandex & MSU, Moscow, Russian Fed.

Yandex & MSU, Moscow, Russian Fed.
View Profile

,
Ilya Katsev

Yandex, Moscow, Russian Fed.

Yandex, Moscow, Russian Fed.
View Profile

,
Valeriya Tsoy

Yandex, Moscow, Russian Fed.

Yandex, Moscow, Russian Fed.
View Profile

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data MiningFebruary 2018Pages 55–63https://doi.org/10.1145/3159652.3159699

Published:02 February 2018Publication History

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

Pages 55–63

ABSTRACT

We study ratio overall evaluation criteria (user behavior quality metrics) and, in particular, average values of non-user level metrics, that are widely used in A/B testing as an important part of modern Internet companies» evaluation instruments (e.g., abandonment rate, a user»s absence time after a session).

We focus on the problem of sensitivity improvement of these criteria, since there is a large gap between the variety of sensitivity improvement techniques designed for user level metrics and the variety of such techniques for ratio criteria.

We propose a novel transformation of a ratio criterion to the average value of a user level (randomization-unit level, in general) metric that creates an opportunity to directly use a wide range of sensitivity improvement techniques designed for the user level that make A/B tests more efficient. We provide theoretical guarantees on the novel metric»s consistency in terms of preservation of two crucial properties (directionality and significance level) w.r.t. the source ratio criteria.

The experimental evaluation of the approach is done on hundreds large-scale real A/B tests run at one of the most popular global search engines, reinforces the theoretical results, and demonstrates up to $+34%$ of sensitivity rate improvement achieved by the transformation combined with the best known regression adjustment.

References

Olga Arkhipova, Lidia Grauer, Igor Kuralenok, and Pavel Serdyukov . 2015. Search Engine Evaluation based on Search Engine Switching Prediction SIGIR'2015. ACM, 723--726. Google ScholarDigital Library
Eytan Bakshy and Dean Eckles . 2013. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods KDD'2013. 1303--1311. Google ScholarDigital Library
Shuchi Chawla, Jason Hartline, and Denis Nekipelov . 2016. A/B testing of auctions. In EC'2016. Google ScholarDigital Library
Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham . 2009. Seven pitfalls to avoid when running controlled experiments on the web KDD'2009. 1105--1114. Google ScholarDigital Library
Alex Deng, Jiannan Lu, and Jonthan Litz . 2017. Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions WSDM'2017. 641--649. Google ScholarDigital Library
Alex Deng and Xiaolin Shi . 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned KDD'2016. Google ScholarDigital Library
Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker . 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data WSDM'2013. 123--132. Google ScholarDigital Library
Pavel Dmitriev and Xian Wu . 2016. Measuring Metrics CIKM'2016. 429--437. Google ScholarDigital Library
Alexey Drutsa . 2015. Sign-Aware Periodicity Metrics of User Engagement for Online Search Quality Evaluation SIGIR'2015. 779--782. Google ScholarDigital Library
Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov . 2015 a. Engagement Periodicity in Search Engine Usage: Analysis and Its Application to Search Quality Evaluation. In WSDM'2015. 27--36. Google ScholarDigital Library
Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov . 2015 b. Future User Engagement Prediction and its Application to Improve the Sensitivity of Online Experiments. In WWW'2015. 256--266. Google ScholarDigital Library
Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov . 2017 a. Periodicity in User Engagement with a Search Engine and its Application to Online Controlled Experiments. ACM Transactions on the Web (TWEB) Vol. 11 (2017). Google ScholarDigital Library
Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov . 2017 b. Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments WWW'2017. Google ScholarDigital Library
Alexey Drutsa, Anna Ufliand, and Gleb Gusev . 2015 c. Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics CIKM'2015. 763--772. Google ScholarDigital Library
Georges Dupret and Mounia Lalmas . 2013. Absence time and user engagement: evaluating ranking functions WSDM'2013. 173--182. Google ScholarDigital Library
Bradley Efron and Robert J Tibshirani . 1994. An introduction to the bootstrap. CRC press.Google Scholar
David A Freedman . 2008. On regression adjustments to experimental data. Advances in Applied Mathematics Vol. 40, 2 (2008), 180--193. Google ScholarDigital Library
David A Freedman, David Collier, Jasjeet S Sekhon, and Philip B Stark . 2010. Statistical models and causal inference: a dialogue with the social sciences. Cambridge University Press.Google Scholar
Henning Hohnhold, Deirdre O'Brien, and Diane Tang . 2015. Focusing on the Long-term: It's Good for Users and Business KDD'2015. 1849--1858. Google ScholarDigital Library
Bernard J Jansen, Amanda Spink, and Vinish Kathuria . 2007. How to define searching sessions on web search engines. Advances in Web Mining and Web Usage Analysis. Springer, 92--109. Google ScholarDigital Library
Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov . 2017. Learning Sensitive Combinations of A/B Test Metrics WSDM'2017. Google ScholarDigital Library
Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis . 2015 a. Optimised Scheduling of Online Experiments. In SIGIR'2015. 453--462. Google ScholarDigital Library
Eugene Kharitonov, Aleksandr Vorobev, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis . 2015 b. Sequential Testing for Early Stopping of Online Experiments SIGIR'2015. 473--482. Google ScholarDigital Library
Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Juan Lavista Ferres, and Tamir Melamed . 2009 a. Online experimentation at Microsoft. Data Mining Case Studies (2009), 11.Google Scholar
Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu . 2012. Trustworthy online controlled experiments: Five puzzling outcomes explained KDD'2012. 786--794. Google ScholarDigital Library
Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann . 2013. Online controlled experiments at large scale. In KDD'2013. 1168--1176. Google ScholarDigital Library
R. Kohavi, A. Deng, R. Longbotham, and Y. Xu . 2014. Seven Rules of Thumb for Web Site Experimenters. KDD'2014. Google ScholarDigital Library
Ron Kohavi, Randal M Henne, and Dan Sommerfield . 2007. Practical guide to controlled experiments on the web: listen to your customers not to the hippo KDD'2007. 959--967. Google ScholarDigital Library
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne . 2009 b. Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Discov. Vol. 18, 1 (2009), 140--181. Google ScholarDigital Library
Ron Kohavi, David Messner, Seth Eliot, Juan Lavista Ferres, Randy Henne, Vignesh Kannappan, and Justin Wang . 2010. Tracking Users' Clicks and Submits: Tradeoffs between User Experience and Data Loss. (2010).Google Scholar
Stephen L Morgan and Christopher Winship . 2014. Counterfactuals and causal inference. Cambridge University Press.Google Scholar
Kirill Nikolaev, Alexey Drutsa, Ekaterina Gladkikh, Alexander Ulianov, Gleb Gusev, and Pavel Serdyukov . 2015. Extreme States Distribution Decomposition Method for Search Engine Online Evaluation KDD'2015. 845--854. Google ScholarDigital Library
Eric T Peterson . 2004. Web analytics demystified: a marketer's guide to understanding how your web site affects your business. Ingram.Google Scholar
Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel Serdyukov . 2016. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In KDD'2016. 235--244. Google ScholarDigital Library
Filip Radlinski, Madhu Kurup, and Thorsten Joachims . 2008. How does clickthrough data reflect retrieval quality? CIKM'2008. 43--52. Google ScholarDigital Library
Kerry Rodden, Hilary Hutchinson, and Xin Fu . 2010. Measuring the user experience on a large scale: user-centered metrics for web applications CHI'2010. 2395--2398. Google ScholarDigital Library
Tetsuya Sakai . 2006. Evaluating evaluation metrics based on the bootstrap SIGIR'2006. 525--532. Google ScholarDigital Library
Yang Song, Xiaolin Shi, and Xin Fu . 2013. Evaluating and predicting user engagement change with degraded search relevance WWW'2013. 1213--1224. Google ScholarDigital Library
Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer . 2010. Overlapping experiment infrastructure: More, better, faster experimentation KDD'2010. 17--26. Google ScholarDigital Library
Huizhi Xie and Juliette Aurisset . 2016. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix KDD'2016. Google ScholarDigital Library
Ya Xu and Nanyu Chen . 2016. Evaluating Mobile Apps with A/B and Quasi A/B Tests KDD'2016. Google ScholarDigital Library
Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin . 2015. From infrastructure to culture: A/B testing challenges in large scale social networks KDD'2015. Google ScholarDigital Library

Index Terms

Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments

Recommendations

Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

Online controlled experiments, e.g., A/B testing, is the state-of-the-art approach used by modern Internet companies to improve their services based on data-driven decisions. The most challenging problem is to define an appropriate online metric of user ...
Read More
Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Controlled experiments are widely regarded as the most scientific way to establish a true causal relationship between product changes and their impact on business metrics. Many technology companies rely on such experiments as their main data-driven ...
Read More
Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments
WWW '17: Proceedings of the 26th International Conference on World Wide Web

State-of-the-art user engagement metrics (such as session-per-user) are widely used by modern Internet companies to evaluate ongoing updates of their web services via A/B testing. These metrics are predictive of companies' long-term goals, but suffer ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining
February 2018
821 pages
ISBN:9781450355810
DOI:10.1145/3159652
General Chairs:
Yi Chang
Jilin University, Huawei Inc.
,
Chengxiang Zhai
University of Illinois Urbana-Champaign
,
Program Chairs:
Yan Liu
University of Southern California
,
Yoelle Maarek
Amazon
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 February 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
a/b test
delta method
directionality
linearization
non-user level metric
online controlled experiment
ratio oec
sensitivity
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '18 Paper Acceptance Rate81of514submissions,16%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 318
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics

Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix

Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments