research-article

Comparing the sensitivity of information retrieval metrics

Authors:

Filip Radlinski,

Nick CraswellAuthors Info & Claims

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 667 - 674

https://doi.org/10.1145/1835449.1835560

Published: 19 July 2010 Publication History

Abstract

Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods.

We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.

References

[1]

A. Al Maskari, M. Sanderson, P. Clough, and E. Airio. The good and the bad system: does the test collection predict users' effectiveness? In Proc. of SIGIR, 2008.

Digital Library

[2]

Andrew Turpin and Falk Scholer. User Performance versus Precision Measures for Simple Search Tasks. In Proc. of SIGIR, 2006.

Digital Library

[3]

Ben Carterette and Rosie Jones. Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks. In Proc. of NIPS, 2007.

[4]

A. Broder. A taxonomy of web search. SIGIR Forum, 26(2):3--10, 2002.

Digital Library

[5]

W. B. Croft, D. Metzler, and T. Strohman. Search Engines: Information Retrieval in Practice. Addison Wesley, 2009.

Digital Library

[6]

Dennis Fetterly, Mark Manasse, and Marc Najork. On The Evolution of Clusters of Near-Duplicate Web Pages. In LA-WEB, pages 37--45, 2003.

Digital Library

[7]

Diane Kelly, Xin Fu, and Chirag Shah. Effects of rank and precision of search results on users' evaluations of system performance. Technical Report TR-2007-02., UNC SILS, 2007.

[8]

Ellen M. Voorhees and Chris Buckley. The effect of topic set size on retrieval experiment error. In Proc. of SIGIR, 2002.

Digital Library

[9]

Falk Scholer and Andrew Turpin. Metric and relevance mismatch in retrieval evaluation. In Proc. of the Asia Information Retrieval Symposium, 2009.

Digital Library

[10]

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How Does Clickthrough Data Reflect Retrieval Quality. In Proc. of CIKM, 2008.

Digital Library

[11]

W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do Batch and User Evaluations Give the Same Results? In Proc. of SIGIR, 2000.

Digital Library

[12]

James Allan, Ben Carterette, and J. Lewis. When Will Information Retrieval be "Good Enough"? In Proc. of SIGIR, 2005.

Digital Library

[13]

Mark Sanderson and Justin Zobel. Information Retrieval System Evaluation: Effort, Sensitivity and Reliability. In Proc. of SIGIR, 2005.

Digital Library

[14]

Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of SIGIR, 2008.

Digital Library

[15]

Scott B. Huffman and Michael Hochster. How Well does Result Relevance Predict Session Satisfaction? In Proc. of SIGIR, 2007.

Digital Library

[16]

P. Thomas and D. Hawking. Evaluation by comparing result sets in context. In Proc. of CIKM, 2006.

Digital Library

[17]

Thorsten Joachims. Optimizing Search Engines Using Clickthrough Data. In Proc. of KDD, 2002.

Digital Library

[18]

Text Retrieval Conference. http://trec.nist.gov/.

[19]

E. M. Voorhees and D. K. Harman, editors. TREC: Experiments in Information Retrieval Evaluation. MIT Press, 2005.

[20]

Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. In Proc. of SIGIR, 2010.

Digital Library

Cited By

Zhong SHuang ZGao SWen WLin LZitnik MZhou P(2024)Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01258(13246-13257)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01258
Veldkamp KGrasman RMolenaar D(2024)Recommendation with item response theoryBehaviormetrika10.1007/s41237-024-00244-3Online publication date: 26-Nov-2024
https://doi.org/10.1007/s41237-024-00244-3
Liu PYüksel SDinçer HOlaru G(2024)Artificial Intelligence-Based Expert Prioritizing and Hybrid Quantum Picture Fuzzy Rough Sets for Investment Decisions of Virtual Energy Market in the MetaverseInternational Journal of Fuzzy Systems10.1007/s40815-024-01716-026:7(2109-2131)Online publication date: 20-Apr-2024
https://doi.org/10.1007/s40815-024-01716-0
Show More Cited By

Index Terms

Comparing the sensitivity of information retrieval metrics
1. Information systems
  1. Information retrieval

Recommendations

Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval

The collection of relevance judgements by assessors is important for many information retrieval (IR) tasks. In addition to the construction of test collections, relevance judging is critical to e-discovery and other applications where many assessors are ...
A qualitative exploration of secondary assessor relevance judging behavior
IIiX '14: Proceedings of the 5th Information Interaction in Context Symposium

Secondary assessors frequently differ in their relevance judgments. Primary assessors are those that originate a search topic and whose judgments truly reflect the assessor's relevance criteria. Secondary assessors do not originate the search and must ...
Estimating Measurement Uncertainty for Information Retrieval Effectiveness Metrics
Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and Analyses

One typical way of building test collections for offline measurement of information retrieval systems is to pool the ranked outputs of different systems down to some chosen depth d and then form relevance judgments for those documents only. Non-pooled ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

July 2010

944 pages

ISBN:9781450301534

DOI:10.1145/1835449

General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '10

Sponsor:

SIGIR

SIGIR '10: The 33rd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2010

Geneva, Switzerland

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

82
Total Citations
View Citations
1,097
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)15

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhong SHuang ZGao SWen WLin LZitnik MZhou P(2024)Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01258(13246-13257)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01258
Veldkamp KGrasman RMolenaar D(2024)Recommendation with item response theoryBehaviormetrika10.1007/s41237-024-00244-3Online publication date: 26-Nov-2024
https://doi.org/10.1007/s41237-024-00244-3
Liu PYüksel SDinçer HOlaru G(2024)Artificial Intelligence-Based Expert Prioritizing and Hybrid Quantum Picture Fuzzy Rough Sets for Investment Decisions of Virtual Energy Market in the MetaverseInternational Journal of Fuzzy Systems10.1007/s40815-024-01716-026:7(2109-2131)Online publication date: 20-Apr-2024
https://doi.org/10.1007/s40815-024-01716-0
Wang ZXu QYang ZWen PHe YCao XHuang Q(2024)Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-label ClassificationInternational Journal of Computer Vision10.1007/s11263-024-02157-w133:1(211-253)Online publication date: 26-Jul-2024
https://doi.org/10.1007/s11263-024-02157-w
Nguyen LVo QNguyen T(2023)Adaptive KNN-Based Extended Collaborative Filtering Recommendation ServicesBig Data and Cognitive Computing10.3390/bdcc70201067:2(106)Online publication date: 31-May-2023
https://doi.org/10.3390/bdcc7020106
Bi NLi BGao REdge GAhuja S(2023)Interleaved Online Testing in Large-Scale SystemsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587572(921-926)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587572
Draws TRoy NInel ORieger AHada RYalcin MTimmermans BTintarev N(2023)Viewpoint Diversity in Search ResultsAdvances in Information Retrieval10.1007/978-3-031-28244-7_18(279-297)Online publication date: 17-Mar-2023
https://doi.org/10.1007/978-3-031-28244-7_18
Stone M(2022)Understanding and Evaluating Search ExperienceSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01166ED1V01Y202202ICR07714:1(1-105)Online publication date: 28-Mar-2022
https://doi.org/10.2200/S01166ED1V01Y202202ICR077
Omri SSinz CGuizzo GPanichella S(2022)Learning to rank for test case prioritizationProceedings of the 15th Workshop on Search-Based Software Testing10.1145/3526072.3527525(16-24)Online publication date: 9-May-2022
https://dl.acm.org/doi/10.1145/3526072.3527525
Bi NCastells PGilbert DGalperin STardif PAhuja SAl Hasan MXiong L(2022)Debiased Balanced Interleaving at Amazon SearchProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557123(2913-2922)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557123
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten