ABSTRACT
To address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary judgment criterion. The second one collects multidimensional assessments to complement relevance or usefulness judgments, with four distinct alternative aspects examined in this paper - novelty, understandability, reliability, and effort.
We evaluate different types of judgments by correlating them with six user experience measures collected from a lab user study. Results show that switching from TREC-style relevance criteria to usefulness is fruitful, but in situ judgments do not exhibit clear benefits over the judgments collected without context. In contrast, combining relevance or usefulness with the four alternative judgments consistently improves the correlation with user experience measures, suggesting future IR systems should adopt multi-aspect search result judgments in development and evaluation.
We further examine implicit feedback techniques for predicting these judgments. We find that click dwell time, a popular indicator of search result quality, is able to predict some but not all dimensions of the judgments. We enrich the current implicit feedback methods using post-click user interaction in a search session and achieve better prediction for all six dimensions of judgments.
- M. Ageev, Q. Guo, D. Lagun, and E. Agichtein. Find it if you can: A game for modeling different types of web search success using interaction data. In SIGIR '11, pages 345--354, 2011. Google ScholarDigital Library
- E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR '06, pages 19--26, 2006. Google ScholarDigital Library
- G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357--389, 2002. Google ScholarDigital Library
- J. Arguello. Predicting search task difficulty. In ECIR '14, pages 88--99, 2014. Google ScholarDigital Library
- N. J. Belkin, M. J. Cole, and J. Liu. A model for evaluation of interactive information retrieval. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, 2009.Google Scholar
- J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In SIGIR '98, pages 335--336, 1998. Google ScholarDigital Library
- B. Carterette, P. Clough, M. Hall, E. Kanoulas, and M. Sanderson. Evaluating retrieval over sessions: The TREC session track 2011--2014. In SIGIR '16, pages 685--688, 2016.Google ScholarDigital Library
- C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In SIGIR '08, pages 659--666, 2008. Google ScholarDigital Library
- C. W. Cleverdon. The evaluation of systems used in information retrieval. In Proceedings of the International Conference on Scientific Information, pages 687--698, 1959.Google Scholar
- K. Collins-Thompson, C. Macdonald, P. Bennett, F. Diaz, and E. Voorhees. TREC 2014 web track overview. In TREC 2014, 2014.Google Scholar
- N. Dai, M. Shokouhi, and B. D. Davison. Learning to rank for freshness and relevance. In SIGIR '11, pages 95--104, 2011. Google ScholarDigital Library
- H. A. Feild and J. Allan. Modeling searcher frustration. In HCIR '09, pages 5--8, 2009.Google Scholar
- H. A. Feild, J. Allan, and R. Jones. Predicting searcher frustration. In SIGIR '10, pages 34--41, 2010. Google ScholarDigital Library
- D. Guan, S. Zhang, and H. Yang. Utilizing query change for session search. In SIGIR '13, pages 453--462, 2013. Google ScholarDigital Library
- J. Gwizdka. Revisiting search task difficulty: Behavioral and individual difference measures. In ASIS&T '08, 2008.Google ScholarCross Ref
- P. Hansen and J. Karlgren. Effects of foreign language and task scenario on relevance assessment. J. Doc., 61(5):623--639, 2005. Google ScholarCross Ref
- A. Hassan. A semi-supervised approach to modeling web search satisfaction. In SIGIR '12, pages 275--284, 2012. Google ScholarDigital Library
- A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In WSDM '10, pages 221--230, 2010.Google ScholarDigital Library
- R. Hu and P. Pu. A study on user perception of personality-based recommender systems. In UMAP '10, pages 291--302, 2010. Google ScholarDigital Library
- J. Jiang and J. Allan. Adaptive effort for search evaluation metrics. In ECIR '16, pages 187--199, 2016. Google ScholarCross Ref
- J. Jiang, A. Hassan Awadallah, X. Shi, and R. W. White. Understanding and predicting graded search satisfaction. In WSDM '15, pages 57--66, 2015. Google ScholarDigital Library
- J. Jiang, D. He, and J. Allan. Searching, browsing, and clicking in a search session: Changes in user behavior by task and over time. In SIGIR '14, pages 607--616, 2014.Google ScholarDigital Library
- J. Jiang, D. He, D. Kelly, and J. Allan. Understanding ephemeral state of relevance. In CHIIR '17, pages 137--146, 2017. Google ScholarDigital Library
- D. Kelly, J. Arguello, A. Edwards, and W.-c. Wu. Development and evaluation of search tasks for IIR experiments using a cognitive complexity framework. In ICTIR '15, pages 101--110, 2015. Google ScholarDigital Library
- J. Y. Kim, J. Teevan, and N. Craswell. Explicit in situ user feedback for web search results. In SIGIR '16, pages 829--832, 2016. Google ScholarDigital Library
- J. Kiseleva, E. Crestan, R. Brigo, and R. Dittel. Modelling and detecting changes in user satisfaction. In CIKM '14, pages 1449--1458, 2014. Google ScholarDigital Library
- B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 22(4--5):441--504, 2012.Google Scholar
- Y. Li and N. J. Belkin. A faceted approach to conceptualizing tasks in information seeking. Inf. Process. Manage., 44(6):1822--1837, 2008. Google ScholarDigital Library
- C. Liu, J. Liu, and N. J. Belkin. Predicting search task difficulty at different search stages. In CIKM '14, pages 569--578, 2014. Google ScholarDigital Library
- J. Liu, J. Gwizdka, C. Liu, and N. J. Belkin. Predicting task difficulty for different task types. In ASIS&T '10, 2010. Google ScholarCross Ref
- J. Liu, C. Liu, M. Cole, N. J. Belkin, and X. Zhang. Exploring and predicting search task difficulty. In CIKM '12, pages 1313--1322, 2012. Google ScholarDigital Library
- J. Liu, C. Liu, J. Gwizdka, and N. J. Belkin. Can search systems detect users' task difficulty? Some behavioral signals. In SIGIR '10, pages 845--846, 2010.Google ScholarDigital Library
- J. Liu, C. Liu, X. Yuan, and N. J. Belkin. Understanding searchers' perception of task difficulty: Relationships with task type. In ASIS&T '11, 2011.Google ScholarCross Ref
- T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR 2007 workshop on learning to rank for information retrieval, pages 3--10, 2007.Google Scholar
- Y. Liu, Y. Chen, J. Tang, J. Sun, M. Zhang, S. Ma, and X. Zhu. Different users, different opinions: Predicting search satisfaction with mouse movement information. In SIGIR '15, pages 493--502, 2015.Google ScholarDigital Library
- J. Mao, Y. Liu, K. Zhou, J.-Y. Nie, J. Song, M. Zhang, S. Ma, J. Sun, and H. Luo. When does relevance mean usefulness and user satisfaction in web search? In SIGIR '16, pages 463--472, 2016. Google ScholarDigital Library
- S. Menard. Applied Logistic Regression Analysis. Sage, 1997.Google Scholar
- D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR '05, pages 472--479, 2005. Google ScholarDigital Library
- A. Olteanu, S. Peshterliev, X. Liu, and K. Aberer. Web credibility: Features exploration and credibility prediction. In ECIR '13, pages 557--568, 2013.Google ScholarDigital Library
- P. Over. The TREC interactive track: An annotated bibliography. Inf. Process. Manage., 37(3):369--381, 2001. Google ScholarDigital Library
- J. Palotti, L. Goeuriot, G. Zuccon, and A. Hanbury. Ranking health web pages with relevance and understandability. In SIGIR '16, pages 965--968, 2016. Google ScholarDigital Library
- J. Palotti, G. Zuccon, and A. Hanbury. The influence of pre-processing on the estimation of readability of web documents. In CIKM '15, pages 1763--1766, 2015. Google ScholarDigital Library
- D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW '10, pages 781--790, 2010. Google ScholarDigital Library
- R. L. Santos, C. Macdonald, and I. Ounis. On the role of novelty for search result diversification. Inf. Retr., 15(5):478--502, 2012. Google ScholarDigital Library
- A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleaved comparisons. In SIGIR '15, pages 463--472, 2015. Google ScholarDigital Library
- J. Schwarz and M. Morris. Augmenting web pages and search results to support credibility assessment. In CHI '11, pages 1245--1254, 2011. Google ScholarDigital Library
- X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In SIGIR '05, pages 43--50, 2005. Google ScholarDigital Library
- R. Tang, W. M. Shaw, Jr., and J. L. Vevea. Towards the identification of the optimal number of relevance categories. J. Am. Soc. Inf. Sci., 50(3):254--264, 1999. Google ScholarDigital Library
- J. van Doorn, D. Odijk, D. M. Roijers, and M. de Rijke. Balancing relevance criteria through multi-objective optimization. In SIGIR '16, pages 769--772, 2016. Google ScholarDigital Library
- M. Verma, E. Yilmaz, and N. Craswell. On obtaining effort based judgements for information retrieval. In WSDM '16, pages 277--286, 2016. Google ScholarDigital Library
- A. Wawer, R. Nielek, and A. Wierzbicki. Predicting webpage credibility using linguistic features. In WWW '14 Companion, pages 1135--1140, 2014. Google ScholarDigital Library
- Y. Xu and Z. Chen. Relevance judgment: What do information users consider beyond topicality? J. Am. Soc. Inf. Sci. Technol., 57(7):961--973, 2006. Google ScholarDigital Library
- Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. In CHI '11, pages 1235--1244, 2011. Google ScholarDigital Library
- E. Yilmaz, M. Verma, N. Craswell, F. Radlinski, and P. Bailey. Relevance and effort: An analysis of document utility. In CIKM '14, pages 91--100, 2014.Google ScholarDigital Library
- C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In SIGIR '03, pages 10--17, 2003.Google ScholarDigital Library
- G. Zuccon. Understandability biased evaluation for information retrieval. In ECIR '16, pages 280--292, 2016. Google ScholarCross Ref
Index Terms
- Comparing In Situ and Multidimensional Relevance Judgments
Recommendations
Understanding Ephemeral State of Relevance
CHIIR '17: Proceedings of the 2017 Conference on Conference Human Information Interaction and RetrievalDespite its dynamic nature, relevance is often measured in a context-independent manner in information retrieval practice. We look into this discrepancy. We propose a contextual relevance/usefulness measurement called ephemeral state of relevance (ESR), ...
A user study of relevance judgments for e-discovery
ASIS&T '10: Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47This paper presents a comparative user study that investigates the relevance judgments made by assessors with a law background and assessors without. Four law students and four library and information studies (LIS) students were recruited to judge ...
Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and RetrievalThe collection of relevance judgements by assessors is important for many information retrieval (IR) tasks. In addition to the construction of test collections, relevance judging is critical to e-discovery and other applications where many assessors are ...
Comments