skip to main content
10.1145/1835449.1835560acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Comparing the sensitivity of information retrieval metrics

Published: 19 July 2010 Publication History

Abstract

Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods.
We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.

References

[1]
A. Al Maskari, M. Sanderson, P. Clough, and E. Airio. The good and the bad system: does the test collection predict users' effectiveness? In Proc. of SIGIR, 2008.
[2]
Andrew Turpin and Falk Scholer. User Performance versus Precision Measures for Simple Search Tasks. In Proc. of SIGIR, 2006.
[3]
Ben Carterette and Rosie Jones. Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks. In Proc. of NIPS, 2007.
[4]
A. Broder. A taxonomy of web search. SIGIR Forum, 26(2):3--10, 2002.
[5]
W. B. Croft, D. Metzler, and T. Strohman. Search Engines: Information Retrieval in Practice. Addison Wesley, 2009.
[6]
Dennis Fetterly, Mark Manasse, and Marc Najork. On The Evolution of Clusters of Near-Duplicate Web Pages. In LA-WEB, pages 37--45, 2003.
[7]
Diane Kelly, Xin Fu, and Chirag Shah. Effects of rank and precision of search results on users' evaluations of system performance. Technical Report TR-2007-02., UNC SILS, 2007.
[8]
Ellen M. Voorhees and Chris Buckley. The effect of topic set size on retrieval experiment error. In Proc. of SIGIR, 2002.
[9]
Falk Scholer and Andrew Turpin. Metric and relevance mismatch in retrieval evaluation. In Proc. of the Asia Information Retrieval Symposium, 2009.
[10]
Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How Does Clickthrough Data Reflect Retrieval Quality. In Proc. of CIKM, 2008.
[11]
W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do Batch and User Evaluations Give the Same Results? In Proc. of SIGIR, 2000.
[12]
James Allan, Ben Carterette, and J. Lewis. When Will Information Retrieval be "Good Enough"? In Proc. of SIGIR, 2005.
[13]
Mark Sanderson and Justin Zobel. Information Retrieval System Evaluation: Effort, Sensitivity and Reliability. In Proc. of SIGIR, 2005.
[14]
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of SIGIR, 2008.
[15]
Scott B. Huffman and Michael Hochster. How Well does Result Relevance Predict Session Satisfaction? In Proc. of SIGIR, 2007.
[16]
P. Thomas and D. Hawking. Evaluation by comparing result sets in context. In Proc. of CIKM, 2006.
[17]
Thorsten Joachims. Optimizing Search Engines Using Clickthrough Data. In Proc. of KDD, 2002.
[18]
Text Retrieval Conference. http://trec.nist.gov/.
[19]
E. M. Voorhees and D. K. Harman, editors. TREC: Experiments in Information Retrieval Evaluation. MIT Press, 2005.
[20]
Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. In Proc. of SIGIR, 2010.

Cited By

View all
  • (2024)Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01258(13246-13257)Online publication date: 16-Jun-2024
  • (2024)Recommendation with item response theoryBehaviormetrika10.1007/s41237-024-00244-3Online publication date: 26-Nov-2024
  • (2024)Artificial Intelligence-Based Expert Prioritizing and Hybrid Quantum Picture Fuzzy Rough Sets for Investment Decisions of Virtual Energy Market in the MetaverseInternational Journal of Fuzzy Systems10.1007/s40815-024-01716-026:7(2109-2131)Online publication date: 20-Apr-2024
  • Show More Cited By

Index Terms

  1. Comparing the sensitivity of information retrieval metrics

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
    July 2010
    944 pages
    ISBN:9781450301534
    DOI:10.1145/1835449
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. evaluation
    2. interleaving
    3. search

    Qualifiers

    • Research-article

    Conference

    SIGIR '10
    Sponsor:

    Acceptance Rates

    SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)90
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01258(13246-13257)Online publication date: 16-Jun-2024
    • (2024)Recommendation with item response theoryBehaviormetrika10.1007/s41237-024-00244-3Online publication date: 26-Nov-2024
    • (2024)Artificial Intelligence-Based Expert Prioritizing and Hybrid Quantum Picture Fuzzy Rough Sets for Investment Decisions of Virtual Energy Market in the MetaverseInternational Journal of Fuzzy Systems10.1007/s40815-024-01716-026:7(2109-2131)Online publication date: 20-Apr-2024
    • (2024)Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-label ClassificationInternational Journal of Computer Vision10.1007/s11263-024-02157-w133:1(211-253)Online publication date: 26-Jul-2024
    • (2023)Adaptive KNN-Based Extended Collaborative Filtering Recommendation ServicesBig Data and Cognitive Computing10.3390/bdcc70201067:2(106)Online publication date: 31-May-2023
    • (2023)Interleaved Online Testing in Large-Scale SystemsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587572(921-926)Online publication date: 30-Apr-2023
    • (2023)Viewpoint Diversity in Search ResultsAdvances in Information Retrieval10.1007/978-3-031-28244-7_18(279-297)Online publication date: 17-Mar-2023
    • (2022)Understanding and Evaluating Search ExperienceSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01166ED1V01Y202202ICR07714:1(1-105)Online publication date: 28-Mar-2022
    • (2022)Learning to rank for test case prioritizationProceedings of the 15th Workshop on Search-Based Software Testing10.1145/3526072.3527525(16-24)Online publication date: 9-May-2022
    • (2022)Debiased Balanced Interleaving at Amazon SearchProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557123(2913-2922)Online publication date: 17-Oct-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media