skip to main content
10.1145/3077136.3080840acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Comparing In Situ and Multidimensional Relevance Judgments

Authors Info & Claims
Published:07 August 2017Publication History

ABSTRACT

To address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary judgment criterion. The second one collects multidimensional assessments to complement relevance or usefulness judgments, with four distinct alternative aspects examined in this paper - novelty, understandability, reliability, and effort.

We evaluate different types of judgments by correlating them with six user experience measures collected from a lab user study. Results show that switching from TREC-style relevance criteria to usefulness is fruitful, but in situ judgments do not exhibit clear benefits over the judgments collected without context. In contrast, combining relevance or usefulness with the four alternative judgments consistently improves the correlation with user experience measures, suggesting future IR systems should adopt multi-aspect search result judgments in development and evaluation.

We further examine implicit feedback techniques for predicting these judgments. We find that click dwell time, a popular indicator of search result quality, is able to predict some but not all dimensions of the judgments. We enrich the current implicit feedback methods using post-click user interaction in a search session and achieve better prediction for all six dimensions of judgments.

References

  1. M. Ageev, Q. Guo, D. Lagun, and E. Agichtein. Find it if you can: A game for modeling different types of web search success using interaction data. In SIGIR '11, pages 345--354, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR '06, pages 19--26, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357--389, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Arguello. Predicting search task difficulty. In ECIR '14, pages 88--99, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. J. Belkin, M. J. Cole, and J. Liu. A model for evaluation of interactive information retrieval. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, 2009.Google ScholarGoogle Scholar
  6. J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In SIGIR '98, pages 335--336, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Carterette, P. Clough, M. Hall, E. Kanoulas, and M. Sanderson. Evaluating retrieval over sessions: The TREC session track 2011--2014. In SIGIR '16, pages 685--688, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In SIGIR '08, pages 659--666, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. W. Cleverdon. The evaluation of systems used in information retrieval. In Proceedings of the International Conference on Scientific Information, pages 687--698, 1959.Google ScholarGoogle Scholar
  10. K. Collins-Thompson, C. Macdonald, P. Bennett, F. Diaz, and E. Voorhees. TREC 2014 web track overview. In TREC 2014, 2014.Google ScholarGoogle Scholar
  11. N. Dai, M. Shokouhi, and B. D. Davison. Learning to rank for freshness and relevance. In SIGIR '11, pages 95--104, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. A. Feild and J. Allan. Modeling searcher frustration. In HCIR '09, pages 5--8, 2009.Google ScholarGoogle Scholar
  13. H. A. Feild, J. Allan, and R. Jones. Predicting searcher frustration. In SIGIR '10, pages 34--41, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Guan, S. Zhang, and H. Yang. Utilizing query change for session search. In SIGIR '13, pages 453--462, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Gwizdka. Revisiting search task difficulty: Behavioral and individual difference measures. In ASIS&T '08, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  16. P. Hansen and J. Karlgren. Effects of foreign language and task scenario on relevance assessment. J. Doc., 61(5):623--639, 2005. Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Hassan. A semi-supervised approach to modeling web search satisfaction. In SIGIR '12, pages 275--284, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In WSDM '10, pages 221--230, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Hu and P. Pu. A study on user perception of personality-based recommender systems. In UMAP '10, pages 291--302, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Jiang and J. Allan. Adaptive effort for search evaluation metrics. In ECIR '16, pages 187--199, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  21. J. Jiang, A. Hassan Awadallah, X. Shi, and R. W. White. Understanding and predicting graded search satisfaction. In WSDM '15, pages 57--66, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Jiang, D. He, and J. Allan. Searching, browsing, and clicking in a search session: Changes in user behavior by task and over time. In SIGIR '14, pages 607--616, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Jiang, D. He, D. Kelly, and J. Allan. Understanding ephemeral state of relevance. In CHIIR '17, pages 137--146, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Kelly, J. Arguello, A. Edwards, and W.-c. Wu. Development and evaluation of search tasks for IIR experiments using a cognitive complexity framework. In ICTIR '15, pages 101--110, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Y. Kim, J. Teevan, and N. Craswell. Explicit in situ user feedback for web search results. In SIGIR '16, pages 829--832, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Kiseleva, E. Crestan, R. Brigo, and R. Dittel. Modelling and detecting changes in user satisfaction. In CIKM '14, pages 1449--1458, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 22(4--5):441--504, 2012.Google ScholarGoogle Scholar
  28. Y. Li and N. J. Belkin. A faceted approach to conceptualizing tasks in information seeking. Inf. Process. Manage., 44(6):1822--1837, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Liu, J. Liu, and N. J. Belkin. Predicting search task difficulty at different search stages. In CIKM '14, pages 569--578, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Liu, J. Gwizdka, C. Liu, and N. J. Belkin. Predicting task difficulty for different task types. In ASIS&T '10, 2010. Google ScholarGoogle ScholarCross RefCross Ref
  31. J. Liu, C. Liu, M. Cole, N. J. Belkin, and X. Zhang. Exploring and predicting search task difficulty. In CIKM '12, pages 1313--1322, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Liu, C. Liu, J. Gwizdka, and N. J. Belkin. Can search systems detect users' task difficulty? Some behavioral signals. In SIGIR '10, pages 845--846, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Liu, C. Liu, X. Yuan, and N. J. Belkin. Understanding searchers' perception of task difficulty: Relationships with task type. In ASIS&T '11, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  34. T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR 2007 workshop on learning to rank for information retrieval, pages 3--10, 2007.Google ScholarGoogle Scholar
  35. Y. Liu, Y. Chen, J. Tang, J. Sun, M. Zhang, S. Ma, and X. Zhu. Different users, different opinions: Predicting search satisfaction with mouse movement information. In SIGIR '15, pages 493--502, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Mao, Y. Liu, K. Zhou, J.-Y. Nie, J. Song, M. Zhang, S. Ma, J. Sun, and H. Luo. When does relevance mean usefulness and user satisfaction in web search? In SIGIR '16, pages 463--472, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Menard. Applied Logistic Regression Analysis. Sage, 1997.Google ScholarGoogle Scholar
  38. D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR '05, pages 472--479, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Olteanu, S. Peshterliev, X. Liu, and K. Aberer. Web credibility: Features exploration and credibility prediction. In ECIR '13, pages 557--568, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. P. Over. The TREC interactive track: An annotated bibliography. Inf. Process. Manage., 37(3):369--381, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Palotti, L. Goeuriot, G. Zuccon, and A. Hanbury. Ranking health web pages with relevance and understandability. In SIGIR '16, pages 965--968, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Palotti, G. Zuccon, and A. Hanbury. The influence of pre-processing on the estimation of readability of web documents. In CIKM '15, pages 1763--1766, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW '10, pages 781--790, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. R. L. Santos, C. Macdonald, and I. Ounis. On the role of novelty for search result diversification. Inf. Retr., 15(5):478--502, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleaved comparisons. In SIGIR '15, pages 463--472, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Schwarz and M. Morris. Augmenting web pages and search results to support credibility assessment. In CHI '11, pages 1245--1254, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In SIGIR '05, pages 43--50, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. R. Tang, W. M. Shaw, Jr., and J. L. Vevea. Towards the identification of the optimal number of relevance categories. J. Am. Soc. Inf. Sci., 50(3):254--264, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. van Doorn, D. Odijk, D. M. Roijers, and M. de Rijke. Balancing relevance criteria through multi-objective optimization. In SIGIR '16, pages 769--772, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Verma, E. Yilmaz, and N. Craswell. On obtaining effort based judgements for information retrieval. In WSDM '16, pages 277--286, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. A. Wawer, R. Nielek, and A. Wierzbicki. Predicting webpage credibility using linguistic features. In WWW '14 Companion, pages 1135--1140, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Y. Xu and Z. Chen. Relevance judgment: What do information users consider beyond topicality? J. Am. Soc. Inf. Sci. Technol., 57(7):961--973, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. In CHI '11, pages 1235--1244, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. E. Yilmaz, M. Verma, N. Craswell, F. Radlinski, and P. Bailey. Relevance and effort: An analysis of document utility. In CIKM '14, pages 91--100, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In SIGIR '03, pages 10--17, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. G. Zuccon. Understandability biased evaluation for information retrieval. In ECIR '16, pages 280--292, 2016. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Comparing In Situ and Multidimensional Relevance Judgments

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
      August 2017
      1476 pages
      ISBN:9781450350228
      DOI:10.1145/3077136

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 August 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '17 Paper Acceptance Rate78of362submissions,22%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader