research-article

Learning more powerful test statistics for click-based retrieval evaluation

Authors:

Oliver Chapelle,

Thorsten JoachimsAuthors Info & Claims

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 507 - 514

https://doi.org/10.1145/1835449.1835534

Published: 19 July 2010 Publication History

Abstract

Interleaving experiments are an attractive methodology for evaluating retrieval functions through implicit feedback. Designed as a blind and unbiased test for eliciting a preference between two retrieval functions, an interleaved ranking of the results of two retrieval functions is presented to the users. It is then observed whether the users click more on results from one retrieval function or the other. While it was shown that such interleaving experiments reliably identify the better of the two retrieval functions, the naive approach of counting all clicks equally leads to a suboptimal test. We present new methods for learning how to score different types of clicks so that the resulting test statistic optimizes the statistical power of the experiment. This can lead to substantial savings in the amount of data required for reaching a target confidence level. Our methods are evaluated on an operational search engine over a collection of scientific articles.

References

[1]

E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incroporating user behavior. In ACM Conference on Information Retrieval (SIGIR), 2006.

Digital Library

[2]

E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user interaction models for predicting web search result preferences. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 3?10, New York, NY, USA, 2006. ACM.

Digital Library

[3]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In ICML Workshop on Learning with Partially Classified Training Data, 2005.

[4]

C. Buckley and E. M. Vorhees. Retrieval evaluation with incomplete information. In ACM Conference on Information Retrieval (SIGIR), 2004.

Digital Library

[5]

C. Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using amazon's mechanical turk. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2009.

Digital Library

[6]

B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2006.

Digital Library

[7]

B. Carterette and R. Jones. Evaluating search engines by modeling the relationship between relevance and clicks. In Conference on Neural Information Processing Systems (NIPS), 2007.

[8]

O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. In World Wide Web Conference (WWW), 2009.

Digital Library

[9]

G. Dupret and C. Liao. Cumulating relevance: A model to estimate document relevance from the clickthrough logs. In ACM Conference on Web Search and Data Mining (WSDM), 2010.

Digital Library

[10]

G. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In ACM Conference on Information Retrieval (SIGIR), 2008.

Digital Library

[11]

S. Fox, K. Karnawat, M. Mydland, S. Dumais, and T. White. Evaluating implicit measures to improve web search. ACM Trans. Inf. Syst., 23(2):147--?168, 2005.

Digital Library

[12]

T. Joachims. Optimizing search engines using clickthrough data. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 133?--142, 2002.

Digital Library

[13]

T. Joachims. Evaluating retrieval performance using clickthrough data. In J. Franke, G. Nakhaeizadeh, and I. Renz, editors, Text Mining, pages 79?--96. Physica/Springer Verlag, 2003.

[14]

T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems (TOIS), 25(2), April 2007.

Digital Library

[15]

D. Laming. Sensory Analysis. Academic Press, 1986.

[16]

A. Mood, F. Graybill, and D. Boes. Introduction to the Theory of Statistics. McGraw-Hill, 3rd edition, 1974.

[17]

F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2010.

Digital Library

[18]

F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Conference on Information and Knowledge Management (CIKM), 2008.

Digital Library

[19]

R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast?but is it good?: evaluating non-expert annotations for natural language tasks. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 254?--263, Morristown, NJ, USA, 2008. Association for Computational Linguistics.

Digital Library

[20]

E. M. Vorhees and D. K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.

Digital Library

[21]

K. Wang, T. Walker, and Z. Zheng. Pskip: Estimating relevance ranking quality from web search clickthrough data. In ACM Conference on Knowledge Discovery and Data Mining (KDD), 2009.

Digital Library

[22]

Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Evaluating result attractiveness as a source of presentation bias in clickthrough data. In World Wide Web Conference (WWW), 2010.

Digital Library

Cited By

Jeunen OUstimenko ABaeza-Yates RBonchi F(2024)Learning Metrics that Maximise Power for Accelerated A/B-TestsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671512(5183-5193)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671512
Bi NCastells PGilbert DGalperin STardif PAhuja SAl Hasan MXiong L(2022)Debiased Balanced Interleaving at Amazon SearchProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557123(2913-2922)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557123
Wang HKim SMcCord-Snook EWu QWang HPiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Variance Reduction in Gradient Exploration for Online Learning to RankProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331264(835-844)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331264
Show More Cited By

Index Terms

Learning more powerful test statistics for click-based retrieval evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

Are click-through data adequate for learning web search rankings?
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Learning-to-rank algorithms, which can automatically adapt ranking functions in web search, require a large volume of training data. A traditional way of generating training examples is to employ human experts to judge the relevance of documents. ...
How does clickthrough data reflect retrieval quality?
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval ...
Exploring relevance for clicks
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Mining feedback information from user click-through data is an important issue for modern Web retrieval systems in terms of architecture analysis, performance evaluation and algorithm optimization. For commercial search engines, user click-through data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

July 2010

944 pages

ISBN:9781450301534

DOI:10.1145/1835449

General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '10

Sponsor:

SIGIR

SIGIR '10: The 33rd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2010

Geneva, Switzerland

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
500
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jeunen OUstimenko ABaeza-Yates RBonchi F(2024)Learning Metrics that Maximise Power for Accelerated A/B-TestsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671512(5183-5193)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671512
Bi NCastells PGilbert DGalperin STardif PAhuja SAl Hasan MXiong L(2022)Debiased Balanced Interleaving at Amazon SearchProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557123(2913-2922)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557123
Wang HKim SMcCord-Snook EWu QWang HPiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Variance Reduction in Gradient Exploration for Online Learning to RankProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331264(835-844)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331264
Oosterhuis Hde Rijke MLim EWinslett MSanderson MFu ASun JCulpepper SLo EHo JDonato DAgrawal RZheng YCastillo CSun ATseng VLi C(2017)Sensitive and Scalable Online Evaluation with Theoretical GuaranteesProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3132895(77-86)Online publication date: 6-Nov-2017
https://dl.acm.org/doi/10.1145/3132847.3132895
Kharitonov EDrutsa ASerdyukov Pde Rijke MShokouhi MTomkins AZhang M(2017)Learning Sensitive Combinations of A/B Test MetricsProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018708(651-659)Online publication date: 2-Feb-2017
https://dl.acm.org/doi/10.1145/3018661.3018708
Hofmann KLi LRadlinski F(2016)Online Evaluation for Information RetrievalFoundations and Trends in Information Retrieval10.1561/150000005110:1(1-117)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1561/1500000051
Albrecht SCrandall JRamamoorthy S(2016)Belief and truth in hypothesised behavioursArtificial Intelligence10.1016/j.artint.2016.02.004235:C(63-94)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.artint.2016.02.004
Kanoulas E(2016)A Short Survey on Online and Offline Methods for Search Quality EvaluationInformation Retrieval10.1007/978-3-319-41718-9_3(38-87)Online publication date: 26-Jul-2016
https://doi.org/10.1007/978-3-319-41718-9_3
Albrecht SRamamoorthy S(2015)Are you doing what i think you are doing? criticising uncertain agent modelsProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence10.5555/3020847.3020854(52-61)Online publication date: 12-Jul-2015
https://dl.acm.org/doi/10.5555/3020847.3020854
Kharitonov EMacdonald CSerdyukov POunis IBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Generalized Team Draft InterleavingProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806477(773-782)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806477
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten