research-article

Comparing In Situ and Multidimensional Relevance Judgments

Authors:
Jiepu Jiang

University of Massachusetts Amherst, Amherst, MA, USA

University of Massachusetts Amherst, Amherst, MA, USA
View Profile

,
Daqing He

University of Pittsburgh, Pittsburgh, PA, USA

University of Pittsburgh, Pittsburgh, PA, USA
View Profile

,
James Allan

University of Massachusetts Amherst, Amherst, MA, USA

University of Massachusetts Amherst, Amherst, MA, USA
View Profile

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalAugust 2017Pages 405–414https://doi.org/10.1145/3077136.3080840

Published:07 August 2017Publication History

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 405–414

ABSTRACT

To address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary judgment criterion. The second one collects multidimensional assessments to complement relevance or usefulness judgments, with four distinct alternative aspects examined in this paper - novelty, understandability, reliability, and effort.

We evaluate different types of judgments by correlating them with six user experience measures collected from a lab user study. Results show that switching from TREC-style relevance criteria to usefulness is fruitful, but in situ judgments do not exhibit clear benefits over the judgments collected without context. In contrast, combining relevance or usefulness with the four alternative judgments consistently improves the correlation with user experience measures, suggesting future IR systems should adopt multi-aspect search result judgments in development and evaluation.

We further examine implicit feedback techniques for predicting these judgments. We find that click dwell time, a popular indicator of search result quality, is able to predict some but not all dimensions of the judgments. We enrich the current implicit feedback methods using post-click user interaction in a search session and achieve better prediction for all six dimensions of judgments.

References

M. Ageev, Q. Guo, D. Lagun, and E. Agichtein. Find it if you can: A game for modeling different types of web search success using interaction data. In SIGIR '11, pages 345--354, 2011. Google ScholarDigital Library
E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR '06, pages 19--26, 2006. Google ScholarDigital Library
G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357--389, 2002. Google ScholarDigital Library
J. Arguello. Predicting search task difficulty. In ECIR '14, pages 88--99, 2014. Google ScholarDigital Library
N. J. Belkin, M. J. Cole, and J. Liu. A model for evaluation of interactive information retrieval. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, 2009.Google Scholar
J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In SIGIR '98, pages 335--336, 1998. Google ScholarDigital Library
B. Carterette, P. Clough, M. Hall, E. Kanoulas, and M. Sanderson. Evaluating retrieval over sessions: The TREC session track 2011--2014. In SIGIR '16, pages 685--688, 2016.Google ScholarDigital Library
C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In SIGIR '08, pages 659--666, 2008. Google ScholarDigital Library
C. W. Cleverdon. The evaluation of systems used in information retrieval. In Proceedings of the International Conference on Scientific Information, pages 687--698, 1959.Google Scholar
K. Collins-Thompson, C. Macdonald, P. Bennett, F. Diaz, and E. Voorhees. TREC 2014 web track overview. In TREC 2014, 2014.Google Scholar
N. Dai, M. Shokouhi, and B. D. Davison. Learning to rank for freshness and relevance. In SIGIR '11, pages 95--104, 2011. Google ScholarDigital Library
H. A. Feild and J. Allan. Modeling searcher frustration. In HCIR '09, pages 5--8, 2009.Google Scholar
H. A. Feild, J. Allan, and R. Jones. Predicting searcher frustration. In SIGIR '10, pages 34--41, 2010. Google ScholarDigital Library
D. Guan, S. Zhang, and H. Yang. Utilizing query change for session search. In SIGIR '13, pages 453--462, 2013. Google ScholarDigital Library
J. Gwizdka. Revisiting search task difficulty: Behavioral and individual difference measures. In ASIS&T '08, 2008.Google ScholarCross Ref
P. Hansen and J. Karlgren. Effects of foreign language and task scenario on relevance assessment. J. Doc., 61(5):623--639, 2005. Google ScholarCross Ref
A. Hassan. A semi-supervised approach to modeling web search satisfaction. In SIGIR '12, pages 275--284, 2012. Google ScholarDigital Library
A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In WSDM '10, pages 221--230, 2010.Google ScholarDigital Library
R. Hu and P. Pu. A study on user perception of personality-based recommender systems. In UMAP '10, pages 291--302, 2010. Google ScholarDigital Library
J. Jiang and J. Allan. Adaptive effort for search evaluation metrics. In ECIR '16, pages 187--199, 2016. Google ScholarCross Ref
J. Jiang, A. Hassan Awadallah, X. Shi, and R. W. White. Understanding and predicting graded search satisfaction. In WSDM '15, pages 57--66, 2015. Google ScholarDigital Library
J. Jiang, D. He, and J. Allan. Searching, browsing, and clicking in a search session: Changes in user behavior by task and over time. In SIGIR '14, pages 607--616, 2014.Google ScholarDigital Library
J. Jiang, D. He, D. Kelly, and J. Allan. Understanding ephemeral state of relevance. In CHIIR '17, pages 137--146, 2017. Google ScholarDigital Library
D. Kelly, J. Arguello, A. Edwards, and W.-c. Wu. Development and evaluation of search tasks for IIR experiments using a cognitive complexity framework. In ICTIR '15, pages 101--110, 2015. Google ScholarDigital Library
J. Y. Kim, J. Teevan, and N. Craswell. Explicit in situ user feedback for web search results. In SIGIR '16, pages 829--832, 2016. Google ScholarDigital Library
J. Kiseleva, E. Crestan, R. Brigo, and R. Dittel. Modelling and detecting changes in user satisfaction. In CIKM '14, pages 1449--1458, 2014. Google ScholarDigital Library
B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 22(4--5):441--504, 2012.Google Scholar
Y. Li and N. J. Belkin. A faceted approach to conceptualizing tasks in information seeking. Inf. Process. Manage., 44(6):1822--1837, 2008. Google ScholarDigital Library
C. Liu, J. Liu, and N. J. Belkin. Predicting search task difficulty at different search stages. In CIKM '14, pages 569--578, 2014. Google ScholarDigital Library
J. Liu, J. Gwizdka, C. Liu, and N. J. Belkin. Predicting task difficulty for different task types. In ASIS&T '10, 2010. Google ScholarCross Ref
J. Liu, C. Liu, M. Cole, N. J. Belkin, and X. Zhang. Exploring and predicting search task difficulty. In CIKM '12, pages 1313--1322, 2012. Google ScholarDigital Library
J. Liu, C. Liu, J. Gwizdka, and N. J. Belkin. Can search systems detect users' task difficulty? Some behavioral signals. In SIGIR '10, pages 845--846, 2010.Google ScholarDigital Library
J. Liu, C. Liu, X. Yuan, and N. J. Belkin. Understanding searchers' perception of task difficulty: Relationships with task type. In ASIS&T '11, 2011.Google ScholarCross Ref
T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR 2007 workshop on learning to rank for information retrieval, pages 3--10, 2007.Google Scholar
Y. Liu, Y. Chen, J. Tang, J. Sun, M. Zhang, S. Ma, and X. Zhu. Different users, different opinions: Predicting search satisfaction with mouse movement information. In SIGIR '15, pages 493--502, 2015.Google ScholarDigital Library
J. Mao, Y. Liu, K. Zhou, J.-Y. Nie, J. Song, M. Zhang, S. Ma, J. Sun, and H. Luo. When does relevance mean usefulness and user satisfaction in web search? In SIGIR '16, pages 463--472, 2016. Google ScholarDigital Library
S. Menard. Applied Logistic Regression Analysis. Sage, 1997.Google Scholar
D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR '05, pages 472--479, 2005. Google ScholarDigital Library
A. Olteanu, S. Peshterliev, X. Liu, and K. Aberer. Web credibility: Features exploration and credibility prediction. In ECIR '13, pages 557--568, 2013.Google ScholarDigital Library
P. Over. The TREC interactive track: An annotated bibliography. Inf. Process. Manage., 37(3):369--381, 2001. Google ScholarDigital Library
J. Palotti, L. Goeuriot, G. Zuccon, and A. Hanbury. Ranking health web pages with relevance and understandability. In SIGIR '16, pages 965--968, 2016. Google ScholarDigital Library
J. Palotti, G. Zuccon, and A. Hanbury. The influence of pre-processing on the estimation of readability of web documents. In CIKM '15, pages 1763--1766, 2015. Google ScholarDigital Library
D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW '10, pages 781--790, 2010. Google ScholarDigital Library
R. L. Santos, C. Macdonald, and I. Ounis. On the role of novelty for search result diversification. Inf. Retr., 15(5):478--502, 2012. Google ScholarDigital Library
A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleaved comparisons. In SIGIR '15, pages 463--472, 2015. Google ScholarDigital Library
J. Schwarz and M. Morris. Augmenting web pages and search results to support credibility assessment. In CHI '11, pages 1245--1254, 2011. Google ScholarDigital Library
X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In SIGIR '05, pages 43--50, 2005. Google ScholarDigital Library
R. Tang, W. M. Shaw, Jr., and J. L. Vevea. Towards the identification of the optimal number of relevance categories. J. Am. Soc. Inf. Sci., 50(3):254--264, 1999. Google ScholarDigital Library
J. van Doorn, D. Odijk, D. M. Roijers, and M. de Rijke. Balancing relevance criteria through multi-objective optimization. In SIGIR '16, pages 769--772, 2016. Google ScholarDigital Library
M. Verma, E. Yilmaz, and N. Craswell. On obtaining effort based judgements for information retrieval. In WSDM '16, pages 277--286, 2016. Google ScholarDigital Library
A. Wawer, R. Nielek, and A. Wierzbicki. Predicting webpage credibility using linguistic features. In WWW '14 Companion, pages 1135--1140, 2014. Google ScholarDigital Library
Y. Xu and Z. Chen. Relevance judgment: What do information users consider beyond topicality? J. Am. Soc. Inf. Sci. Technol., 57(7):961--973, 2006. Google ScholarDigital Library
Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. In CHI '11, pages 1235--1244, 2011. Google ScholarDigital Library
E. Yilmaz, M. Verma, N. Craswell, F. Radlinski, and P. Bailey. Relevance and effort: An analysis of document utility. In CIKM '14, pages 91--100, 2014.Google ScholarDigital Library
C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In SIGIR '03, pages 10--17, 2003.Google ScholarDigital Library
G. Zuccon. Understandability biased evaluation for information retrieval. In ECIR '16, pages 280--292, 2016. Google ScholarCross Ref

Index Terms

Comparing In Situ and Multidimensional Relevance Judgments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

Understanding Ephemeral State of Relevance
CHIIR '17: Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval

Despite its dynamic nature, relevance is often measured in a context-independent manner in information retrieval practice. We look into this discrepancy. We propose a contextual relevance/usefulness measurement called ephemeral state of relevance (ESR), ...
Read More
A user study of relevance judgments for e-discovery
ASIS&T '10: Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47

This paper presents a comparative user study that investigates the relevance judgments made by assessors with a law background and assessors without. Four law students and four library and information studies (LIS) students were recruited to judge ...
Read More
Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval

The collection of relevance judgements by assessors is important for many information retrieval (IR) tasks. In addition to the construction of test collections, relevance judging is critical to e-discovery and other applications where many assessors are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2017
1476 pages
ISBN:9781450350228
DOI:10.1145/3077136
General Chairs:
Noriko Kando
National Institute of Informatics
,
Tetsuya Sakai
Waseda University
,
Hideo Joho
University of Tsukuba
,
Program Chairs:
Hang Li
Huawei Noah's Ark Lab
,
Arjen P. de Vries
Radboud University
,
Ryen W. White
Microsoft Cortana
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 August 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
implicit feedback
relevance judgment
search experience
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '17 Paper Acceptance Rate78of362submissions,22%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 371
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparing In Situ and Multidimensional Relevance Judgments

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Understanding Ephemeral State of Relevance

A user study of relevance judgments for e-discovery

Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?