skip to main content
10.1145/1277741.1277754acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Robust test collections for retrieval evaluation

Published: 23 July 2007 Publication History

Abstract

Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time evaluation, it is not clear that they can be trusted when re-used to evaluate new systems. In this work, we formally define what it means for judgments to be reusable: the confidence in an evaluation of new systems can be accurately assessed from an existing set of relevance judgments. We then present a method for augmenting a set of relevance judgments with relevance estimates that require no additional assessor effort. Using this method practically guarantees reusability: with as few as five judgments per topic taken from only two systems, we can reliably evaluate a larger set of ten systems. Even the smallest sets of judgments can be useful for evaluation of new systems.

References

[1]
J. Aslam and M. Montague. Models for Metasearch. In Proceedings of SIGIR pages 275--285, 2001.
[2]
J. Aslam and R. Savell. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In Proceedings of SIGIR pages 361--362, 2003.
[3]
J. A. Aslam, V. Pavlu, and R. Savell. A uni?ed model for metasearch, pooling, and system evaluation. In Proceedings of CIKM pages 484--491, 2003.
[4]
J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR pages 541--548, 2006.
[5]
A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics 22(1):39--71, 1996.
[6]
D. J. Blower. An easy derivation of logistic regression from the bayesian and maximum entropy perspective. In Proceedings of the 23rd International Workship on Bayesian Inference and Maximum Entropy Methods in Science and Engineering pages 30--43, 2004.
[7]
B. Carterette and J. Allan. Research methodology in studies of assessor effort for retrieval evaluation. In Proceedings of RIAO 2007.
[8]
B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR pages 268--275, 2006.
[9]
B. Carterette and D. I. Petkova. Learning a ranking from pairwise preferences. In Proceedings of SIGIR 2006.
[10]
R. T. Clemen and R. L. Winkler. Unanimity and compromise among probability forecasters. Management Science 36(7):767--779, July 1990.
[11]
G. V. Cormack, C. R. Palmer, and C. L. Clarke. Efficient Construction of Large Test Collections. In Proceedings of SIGIR pages 282--289, 1998.
[12]
A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis Chapman & Hall/CRC, 2004.
[13]
E. T. Jaynes. Probability Theory: The Logic of Science Cambridge University Press, 2003.
[14]
R. Manmatha and H. Sever. A Formal Approach to Score Normalization for Metasearch. In Proceedings of HLT pages 88--93, 2002.
[15]
I. J. Myung, S. Ramamoorti, and J. Andrew D. Baily. Maximum entropy aggregation of expert predictions. Management Science 42(10):1420--1436, October 1996.
[16]
J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. pages 61--74, 2000.
[17]
M. Sanderson and H. Joho. Forming test collections with no system pooling. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pages 33--40,2004.
[18]
I. Soboroff. Dynamic test collections: measuring search effectiveness on the live web. In Proceedings of SIGIR pages 276--283, 2006.
[19]
K. Sparck Jones and C. J. van Rijsbergen. Information Retrieval Test Collections. Journal of Documentation 32(1):59--75, 1976.
[20]
E. M. Voorhees and D. K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval MIT Press, 2005.
[21]
J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR pages 307--314, 1998.

Cited By

View all
  • (2020)ResearchInformation Retrieval: A Biomedical and Health Perspective10.1007/978-3-030-47686-1_8(337-405)Online publication date: 23-Jul-2020
  • (2018)Studying Topical Relevance with Evidence-based CrowdsourcingProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271779(1253-1262)Online publication date: 17-Oct-2018
  • (2018)Estimating Measurement Uncertainty for Information Retrieval Effectiveness MetricsJournal of Data and Information Quality10.1145/323957210:3(1-22)Online publication date: 29-Sep-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation
  2. information retrieval
  3. reusability
  4. test collections

Qualifiers

  • Article

Conference

SIGIR07
Sponsor:
SIGIR07: The 30th Annual International SIGIR Conference
July 23 - 27, 2007
Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)ResearchInformation Retrieval: A Biomedical and Health Perspective10.1007/978-3-030-47686-1_8(337-405)Online publication date: 23-Jul-2020
  • (2018)Studying Topical Relevance with Evidence-based CrowdsourcingProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271779(1253-1262)Online publication date: 17-Oct-2018
  • (2018)Estimating Measurement Uncertainty for Information Retrieval Effectiveness MetricsJournal of Data and Information Quality10.1145/323957210:3(1-22)Online publication date: 29-Sep-2018
  • (2018)When to stop making relevance judgments? A study of stopping methods for building information retrieval test collectionsJournal of the Association for Information Science and Technology10.1002/asi.2407770:1(49-60)Online publication date: 12-Dec-2018
  • (2017)Building Test CollectionsProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3082064(1407-1410)Online publication date: 7-Aug-2017
  • (2017)A Novel Query Extension Method Based on LDAAdvances in Internetworking, Data & Web Technologies10.1007/978-3-319-59463-7_25(253-261)Online publication date: 28-May-2017
  • (2016)Pearson RankProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914728(941-944)Online publication date: 7-Jul-2016
  • (2016)A Short Survey on Online and Offline Methods for Search Quality EvaluationInformation Retrieval10.1007/978-3-319-41718-9_3(38-87)Online publication date: 26-Jul-2016
  • (2015)Search Result DiversificationFoundations and Trends in Information Retrieval10.1561/15000000409:1(1-90)Online publication date: 1-Mar-2015
  • (2015)Language-independent Query Representation for IR Model Parameter Estimation on Unlabeled CollectionsProceedings of the 2015 International Conference on The Theory of Information Retrieval10.1145/2808194.2809451(121-130)Online publication date: 27-Sep-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media