ABSTRACT
We introduce and validate bootstrap techniques to compute confidence intervals that quantify the effect of test-collection variability on average precision (AP) and mean average precision (MAP) IR effectiveness measures. We consider the test collection in IR evaluation to be a representative of a population of materially similar collections, whose documents are drawn from an infinite pool with similar characteristics. Our model accurately predicts the degree of concordance between system results on randomly selected halves of the TREC-6 ad hoc corpus. We advance a framework for statistical evaluation that uses the same general framework to model other sources of chance variation as a source of input for meta-analysis techniques.
- Buckley, C., and Voorhees, E. M. Evaluating evaluation measure stability. In SIGIR Conference 2000 (Athens, Greece, 2000). Google ScholarDigital Library
- Cormack, G. V., Palmer, C. R., and Clarke, C. L. A. Efficient construction of large test collections. In SIGIR Conference 1998 (Melbourne, Australia, 1998). Google ScholarDigital Library
- Efron, B., and Tsibirani, R. J. An Introduction to the Bootstrap. Chapman and Hall, New York, 1994.Google Scholar
- Fisher, R. A. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society 22 (1925), 700--725.Google ScholarCross Ref
- Glass, G. V. Meta-analysis at 25. http://glass.ed.asu.edu/gene/papers/meta25.html, 2000.Google Scholar
- Hull, D. A. Using statistical testing in the evaluation of retrieval experiments. In Research and Development in Information Retrieval (1993), pp. 329--338. Google ScholarDigital Library
- Lenhard, J. Models and statistical inference: The controversy between Fisher and Neyman-Pearson. British Journal for the Philosophy of Science (2006).Google Scholar
- Rothman, K. J., and Greenland, S. Modern Epidemiology. Lippincott Williams & Wilkins, 1998.Google Scholar
- Sanderson, M., and Johno, H. Test collections with no system pooling. In SIGIR Conference 2004 (Sheffield, UK, 2004). Google ScholarDigital Library
- Sanderson, M., and Zobel, J. Information retrieval evaluation: Effort, sensitivity, and reliability. In SIGIR Conference 2005 (Salvador, Brazil, 2005). Google ScholarDigital Library
- Savoy, J. Statistical inference in retrieval effectiveness evaluation. Information Processing and Management 33, 4 (1997), 495--512. Google ScholarDigital Library
- Tague-Sutcliffe, J. The pragmatics of information retrieval experimentation, revisited. Information Processing and Management 28, 4 (1992), 467--490. Google ScholarDigital Library
- Tague-Sutcliffe, J., and Blustein, J. A statistical analysis of the TREC-3 data. In Proceedings of TREC-3, The Third Information Retrieval Conference (1994), pp. 385--398.Google Scholar
- Voorhees, E., and Harman, D. Overview of the Sixth Text REtrieval Conference (TREC-6). In 6th Text REtrieval Conference (Gaithersburg, MD, 1997). Google ScholarDigital Library
- Voorhees, E. M. Variations in relevance judgements and the measurement of retrieval effectiveness. In SIGIR Conference 1998 (Melbourne, Australia, 1998). Google ScholarDigital Library
- Voorhees, E. M. Overview of the TREC-2004 robust track. In 13th Text REtrieval Conference (Gaithersburg, MD, 2004).Google Scholar
- Voorhees, E. M., and Buckley, C. The effect of topic set size on retrieval experiment error. In SIGIR Conference 2002 (Tampere, Finland, 2002). Google ScholarDigital Library
- Voorhees, E. M., and Buckley, C. Retrieval evaluation with incomplete information. In SIGIR Conference 2004 (Sheffield, UK, 2004). Google ScholarDigital Library
- Voorhees, E. M., and Harman, D. K., Eds. TREC - Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
- Zobel, J. How reliable are the results of large-scale information retrieval experiments? In SIGIR Conference 1998 (Melbourne, Australia, 1998). Google ScholarDigital Library
Index Terms
- Statistical precision of information retrieval evaluation
Recommendations
A comparison of statistical significance tests for information retrieval evaluation
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementInformation retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's ...
Nonparametric confidence intervals for population variance of one sample and the difference of variances of two samples
Confidence intervals for the population variance and the difference in variances of two populations based on the ordinary t-statistics combined with the bootstrap method are suggested. Theoretical and practical aspects of the suggested techniques are ...
Asymptotic Expansions and Bootstrap Approximations in Factor Analysis
We derive asymptotic expansions for the distributions of the normal theory maximum likelihood estimators of unique variances and uniquenesses (standardized unique variances) in the factor analysis model. Asymptotic expansions are given for the ...
Comments