Article

Robust test collections for retrieval evaluation

Author:

Ben CarteretteAuthors Info & Claims

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 55 - 62

https://doi.org/10.1145/1277741.1277754

Published: 23 July 2007 Publication History

Abstract

Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time evaluation, it is not clear that they can be trusted when re-used to evaluate new systems. In this work, we formally define what it means for judgments to be reusable: the confidence in an evaluation of new systems can be accurately assessed from an existing set of relevance judgments. We then present a method for augmenting a set of relevance judgments with relevance estimates that require no additional assessor effort. Using this method practically guarantees reusability: with as few as five judgments per topic taken from only two systems, we can reliably evaluate a larger set of ten systems. Even the smallest sets of judgments can be useful for evaluation of new systems.

References

[1]

J. Aslam and M. Montague. Models for Metasearch. In Proceedings of SIGIR pages 275--285, 2001.

Digital Library

[2]

J. Aslam and R. Savell. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In Proceedings of SIGIR pages 361--362, 2003.

Digital Library

[3]

J. A. Aslam, V. Pavlu, and R. Savell. A uni?ed model for metasearch, pooling, and system evaluation. In Proceedings of CIKM pages 484--491, 2003.

Digital Library

[4]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR pages 541--548, 2006.

Digital Library

[5]

A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics 22(1):39--71, 1996.

Digital Library

[6]

D. J. Blower. An easy derivation of logistic regression from the bayesian and maximum entropy perspective. In Proceedings of the 23rd International Workship on Bayesian Inference and Maximum Entropy Methods in Science and Engineering pages 30--43, 2004.

[7]

B. Carterette and J. Allan. Research methodology in studies of assessor effort for retrieval evaluation. In Proceedings of RIAO 2007.

[8]

B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR pages 268--275, 2006.

Digital Library

[9]

B. Carterette and D. I. Petkova. Learning a ranking from pairwise preferences. In Proceedings of SIGIR 2006.

Digital Library

[10]

R. T. Clemen and R. L. Winkler. Unanimity and compromise among probability forecasters. Management Science 36(7):767--779, July 1990.

Digital Library

[11]

G. V. Cormack, C. R. Palmer, and C. L. Clarke. Efficient Construction of Large Test Collections. In Proceedings of SIGIR pages 282--289, 1998.

Digital Library

[12]

A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis Chapman & Hall/CRC, 2004.

[13]

E. T. Jaynes. Probability Theory: The Logic of Science Cambridge University Press, 2003.

[14]

R. Manmatha and H. Sever. A Formal Approach to Score Normalization for Metasearch. In Proceedings of HLT pages 88--93, 2002.

Digital Library

[15]

I. J. Myung, S. Ramamoorti, and J. Andrew D. Baily. Maximum entropy aggregation of expert predictions. Management Science 42(10):1420--1436, October 1996.

Digital Library

[16]

J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. pages 61--74, 2000.

[17]

M. Sanderson and H. Joho. Forming test collections with no system pooling. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pages 33--40,2004.

Digital Library

[18]

I. Soboroff. Dynamic test collections: measuring search effectiveness on the live web. In Proceedings of SIGIR pages 276--283, 2006.

Digital Library

[19]

K. Sparck Jones and C. J. van Rijsbergen. Information Retrieval Test Collections. Journal of Documentation 32(1):59--75, 1976.

[20]

E. M. Voorhees and D. K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval MIT Press, 2005.

Digital Library

[21]

J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR pages 307--314, 1998.

Digital Library

Cited By

Hersh WHersh W(2020)ResearchInformation Retrieval: A Biomedical and Health Perspective10.1007/978-3-030-47686-1_8(337-405)Online publication date: 23-Jul-2020
https://doi.org/10.1007/978-3-030-47686-1_8
Inel OHaralabopoulos GLi DVan Gysel CSzlávik ZSimperl EKanoulas EAroyo LCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)Studying Topical Relevance with Evidence-based CrowdsourcingProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271779(1253-1262)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3271779
Moffat AScholer FYang Z(2018)Estimating Measurement Uncertainty for Information Retrieval Effectiveness MetricsJournal of Data and Information Quality10.1145/323957210:3(1-22)Online publication date: 29-Sep-2018
https://dl.acm.org/doi/10.1145/3239572
Show More Cited By

Index Terms

Robust test collections for retrieval evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
  2. Information storage systems

Recommendations

Minimal test collections for retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In ...
Incremental test collections
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Corpora and topics are readily available for information retrieval research. Relevance judgments, which are necessary for system evaluation, are expensive; the cost of obtaining them prohibits in-house evaluation of retrieval systems on new corpora or ...
Evaluation over thousands of queries
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Information retrieval evaluation has typically been performed over several dozen queries, each judged to near-completeness. There has been a great deal of recent work on evaluation over much smaller judgment sets: how to select the best set of documents ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

July 2007

946 pages

ISBN:9781595935977

DOI:10.1145/1277741

General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR07

Sponsor:

SIGIR07: The 30th Annual International SIGIR Conference

July 23 - 27, 2007

Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
760
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)2

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hersh WHersh W(2020)ResearchInformation Retrieval: A Biomedical and Health Perspective10.1007/978-3-030-47686-1_8(337-405)Online publication date: 23-Jul-2020
https://doi.org/10.1007/978-3-030-47686-1_8
Inel OHaralabopoulos GLi DVan Gysel CSzlávik ZSimperl EKanoulas EAroyo LCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)Studying Topical Relevance with Evidence-based CrowdsourcingProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271779(1253-1262)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3271779
Moffat AScholer FYang Z(2018)Estimating Measurement Uncertainty for Information Retrieval Effectiveness MetricsJournal of Data and Information Quality10.1145/323957210:3(1-22)Online publication date: 29-Sep-2018
https://dl.acm.org/doi/10.1145/3239572
Losada DParapar JBarreiro A(2018)When to stop making relevance judgments? A study of stopping methods for building information retrieval test collectionsJournal of the Association for Information Science and Technology10.1002/asi.2407770:1(49-60)Online publication date: 12-Dec-2018
https://dl.acm.org/doi/10.1002/asi.24077
Soboroff IKando NSakai TJoho HLi Hde Vries AWhite R(2017)Building Test CollectionsProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3082064(1407-1410)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3082064
Zhang YMa JWang ZChen B(2017)A Novel Query Extension Method Based on LDAAdvances in Internetworking, Data & Web Technologies10.1007/978-3-319-59463-7_25(253-261)Online publication date: 28-May-2017
https://doi.org/10.1007/978-3-319-59463-7_25
Gao NBagdouri MOard DPerego RSebastiani FAslam JRuthven IZobel J(2016)Pearson RankProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914728(941-944)Online publication date: 7-Jul-2016
https://dl.acm.org/doi/10.1145/2911451.2914728
Kanoulas E(2016)A Short Survey on Online and Offline Methods for Search Quality EvaluationInformation Retrieval10.1007/978-3-319-41718-9_3(38-87)Online publication date: 26-Jul-2016
https://doi.org/10.1007/978-3-319-41718-9_3
Santos RMacdonald COunis I(2015)Search Result DiversificationFoundations and Trends in Information Retrieval10.1561/15000000409:1(1-90)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1561/1500000040
Goswami PAmini MGaussier EAllan JCroft Bde Vries AZhai C(2015)Language-independent Query Representation for IR Model Parameter Estimation on Unlabeled CollectionsProceedings of the 2015 International Conference on The Theory of Information Retrieval10.1145/2808194.2809451(121-130)Online publication date: 27-Sep-2015
https://dl.acm.org/doi/10.1145/2808194.2809451
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten