ABSTRACT
A robust retrieval system ensures that user experience is not damaged by the presence of poorly-performing queries. Such robustness can be measured by risk-sensitive evaluation measures, which assess the extent to which a system performs worse than a given baseline system. However, using a particular, single system as the baseline suffers from the fact that retrieval performance highly varies among IR systems across topics. Thus, a single system would in general fail in providing enough information about the real baseline performance for every topic under consideration, and hence it would in general fail in measuring the real risk associated with any given system. Based upon the Chi-squared statistic, we propose a new measure ZRisk that exhibits more promise since it takes into account multiple baselines when measuring risk, and a derivative measure called GeoRisk, which enhances ZRisk by also taking into account the overall magnitude of effectiveness. This paper demonstrates the benefits of ZRisk and GeoRisk upon TREC data, and how to exploit GeoRisk for risk-sensitive learning to rank, thereby making use of multiple baselines within the learning objective function to obtain effective yet risk-averse/robust ranking systems. Experiments using 10,000 topics from the MSLR learning to rank dataset demonstrate the efficacy of the proposed Chi-square statistic-based objective function.
- A. Agresti. Categorical Data Analysis. Wiley, 2002. 2nd ed.,Google ScholarCross Ref
- G. Amati, C. Carpineto, and G. Romano. Query difficulty, robustness, and selective application of query expansion. In Proceedings of ECIR, 2004.Google ScholarCross Ref
- T. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: ad-hoc retrieval results since 1998. In Proceedings of ACM CIKM, 2009.łooseness 0 Google ScholarDigital Library
- S. Beitzel, E. Jensen, and O. Frieder. GMAP. In L. Liu and M. Özsu, eds., Encyclopedia of Database Systems, pp 1256--1256, 2009.\pageenlarge2Google ScholarCross Ref
- P. N. Bennett, M. Shokouhi, and R. Caruana. Implicit preference labels for learning highly selective personalized rankers. In Proceedings of ACM ICTIR, 2015. Google ScholarDigital Library
- C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of ICML, 2005. Google ScholarDigital Library
- C. J. Burges. From RankNet to LambdaRank to LambdaMART: An overview. Technical Report MSR-TR-2010--82, Microsoft Research, 2010.Google Scholar
- D. Carmel, E. Farchi, Y. Petruschka, and A. Soffer. Automatic query refinement using lexical affinities with maximal information gain. In Proceedings of ACM SIGIR, 2002. Google ScholarDigital Library
- O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of ACM CIKM, 2009. Google ScholarDigital Library
- C. L. A. Clarke, N. Craswell, and E. Voorhees. Overview of the TREC 2012 Web track. In Proceedings of TREC, 2012.Google Scholar
- K. Collins-Thompson. Reducing the risk of query expansion via robust constrained optimization. In Proceedings of ACM CIKM, 2009. Google ScholarDigital Library
- K. Collins-Thompson, P. Bennett, F. Diaz, C. Clarke, and E. M. Voorhees. Overview of the TREC 2013 Web track. In Proceedings of TREC, 2013.Google Scholar
- B. T. Dinçer, C. Macdonald, and I. Ounis. Hypothesis testing for the risk-sensitive evaluation of retrieval systems. In Proceedings of ACM SIGIR, 2014. Google ScholarDigital Library
- B. T. Dinçer, I. Ounis, and C. Macdonald. Tackling biased baselines in the risk-sensitive evaluation of retrieval systems. In Proceedings of ECIR, 2014.Google Scholar
- Y. Ganjisaffar, R. Caruana, and C. Lopes. Bagging gradient-boosted trees for high precision, low variance ranking models. In Proceedings of ACM SIGIR, 2011. Google ScholarDigital Library
- D. Hoaglin, F. Mosteller, and J. Tukey, eds. Understanding robust & exploratory data analysis. Wiley, 1983.Google Scholar
- S. Kharazmi, F. Scholer, D. Vallet and M. Sanderson. Examining Additivity and Weak Baselines. TOIS, to appear, 2016. Google ScholarDigital Library
- I. Kocabaş, B. T. Dinçer, and B. Karaoglan. A nonparametric term weighting method for information retrieval based on measuring the divergence from independence. Information Retrieval, 17(2):153--176, 2014. Google ScholarDigital Library
- T.-Y. Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 3(3):225--331, 2009. Google ScholarDigital Library
- C. Macdonald, R. L. Santos, and I. Ounis. The whens and hows of learning to rank for web search. Information Retrieval., 16(5):584--628, 2013. Google ScholarDigital Library
- D. A. Metzler. Automatic feature selection in the markov random field model for information retrieval. In Proceedings of ACM CIKM, 2007. Google ScholarDigital Library
- M. Oakes, R. Gaaizauskas, H. Fowkes, A. Jonsson, V. Wan, and M. Beaulieu. A method based on the chi-square test for document classification. In Proceedings of ACM SIGIR, 2001. Google ScholarDigital Library
- S. Robertson. On GMAP - and other transformations. In Proceedings of ACM CIKM, 2006. Google ScholarDigital Library
- E. M. Voorhees. Overview of the TREC 2003 Robust retrieval track. In Proceedings of TREC, 2003.% NIST Special Publication 500--255.Google Scholar
- E. M. Voorhees. The TREC Robust retrieval track. SIGIR Forum, 39(1):11--20, June 2005. Google ScholarDigital Library
- E. M. Voorhees and C. Buckley. The effect of topic set size on retrieval experiment error. In Proceedings of ACM SIGIR, 2002. Google ScholarDigital Library
- L. Wang, P. N. Bennett, and K. Collins-Thompson. Robust ranking models via risk-sensitive optimization. In Proceedings of ACM SIGIR, 2012. Google ScholarDigital Library
- Q. Wu, C. J. C. Burges, K. M. Svore, and J. Gao. Ranking, boosting, and model adaptation. Technical Report MSR-TR-2008--109, Microsoft, 2008.Google Scholar
Index Terms
Risk-Sensitive Evaluation and Learning to Rank using Multiple Baselines
Recommendations
Risk-Sensitive Deep Neural Learning to Rank
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalLearning to Rank (L2R) is the core task of many Information Retrieval systems. Recently, a great effort has been put on exploring Deep Neural Networks (DNNs) for L2R, with significant results. However, risk-sensitiveness, an important and recent advance ...
Hypothesis testing for the risk-sensitive evaluation of retrieval systems
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalThe aim of risk-sensitive evaluation is to measure when a given information retrieval (IR) system does not perform worse than a corresponding baseline system for any topic. This paper argues that risk-sensitive evaluation is akin to the underlying ...
Unbiased LambdaMART: An Unbiased Pairwise Learning-to-Rank Algorithm
WWW '19: The World Wide Web ConferenceRecently a number of algorithms under the theme of 'unbiased learning-to-rank' have been proposed, which can reduce position bias, the major type of bias in click data, and train a high-performance ranker with click data. Most of the existing algorithms,...
Comments