ABSTRACT
While many multidimensional models of relevance have been posited, prior studies have been largely exploratory rather than confirmatory. Lacking a methodological framework to quantify the relationships among factors or measure model fit to observed data, many past models could not be empirically tested or falsified. To enable more positivist experimentation, Xu and Chen [77] proposed a psychometric framework for multidimensional relevance modeling. However, we show their framework exhibits several methodological limitations which could call into question the validity of findings drawn from it. In this work, we identify and address these limitations, scale their methodology via crowdsourcing, and describe quality control methods from psychometrics which stand to benefit crowdsourcing IR studies in general. Methodology we describe for relevance judging is expected to benefit both human-centered and systems-centered IR.
- Alonso, O. 2013. Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Information Retrieval. 16, 2, 101--120. Google ScholarDigital Library
- Anderson, J.C. and Gerbing, D.W. 1988. Structural equation modeling in practice: A review and recommended two-step approach. Psychological Bulletin. 103, 3, 411--423.Google ScholarCross Ref
- Bailey, P. et al. 2008. Relevance assessment: are judges exchangeable and does it matter. SIGIR'08, 667--674. Google ScholarDigital Library
- Balatsoukas, P. and Ruthven, I. 2012. An eye-tracking approach to the analysis of relevance judgments on the Web: The case of Google search engine. JASIST. 63, 9, 1728--1746. Google ScholarDigital Library
- Baokstein, A. 1979. Relevance. JASIS. 30, 5, 269--273.Google ScholarCross Ref
- Barry, C.L. 1994. User-defined relevance criteria: An exploratory study. JASIS. 45, 3, 149--159. Google ScholarDigital Library
- Barry, C.L. and Schamber, L. 1998. Users' criteria for relevance evaluation: A cross-situational comparison. IP & M. 34, 2--3, 219--236. Google ScholarDigital Library
- Bateman, J. 1998. Changes in Relevance Criteria: A Longitudinal Study. Proceedings of the ASIS Annual Meeting. 35, 23--32.Google Scholar
- Behrend, T.S. et al. 2011. The viability of crowdsourcing for survey research. Behavior research methods. 43, 3, 800--813.Google Scholar
- Blanco, R. et al. 2011. Repeatable and Reliable Search System Evaluation Using Crowdsourcing. Proceedings of SIGIR'2011 New York, NY, USA, 923--932. Google ScholarDigital Library
- Borlund, P. 2003. The concept of relevance in IR. JASIST. 54, 10, 913--925. Google ScholarDigital Library
- Boyce, B. 1982. Beyond topicality: A two stage view of relevance and the retrieval process. IP & M 18, 3, 105--109.Google Scholar
- Bradford, S.C. 1934. Sources of information on specific subjects. Engineering: An Illustrated Weekly Journal (London). 137, 26, 85--86.Google Scholar
- Browne, M.W. 2000. Psychometrics. Journal of the American Statistical Association. 95, 450, 661--665.Google ScholarCross Ref
- Cacioppo, J.T. and Petty, R.E. 1984. The Elaboration Likelihood Model of Persuasion. Advances in Consumer Research. 11, 1 673--675.Google Scholar
- Chouldechova, A. and Mease, D. 2013. Differences in Search Engine Evaluations Between Query Owners and Non-owners. Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 103--112. Google ScholarDigital Library
- Cognitive Interviewing: http://www.uk.sagepub.com/textbooks/Book225856?prodId=Book225856. Accessed: 2014-01--24.Google Scholar
- Cohen, J. 1988. Statistical power analysis for the behavioral sciences. L. Erlbaum Associates.Google Scholar
- Cool, C. et al. 1993. Characteristics of Texts affecting relevance judgements. Proceedings of the 14th National Online Meeting, 77--84.Google Scholar
- Da Costa Pereira, C. et al. 2012. Multidimensional relevance: Prioritized aggregation in a personalized Information Retrieval setting. IP & M. 48, 2, 340--357. Google ScholarDigital Library
- Cuadra, C.A. and Katter, R.V. 1967. Opening the Black Box of "Relevance." Journal of Documentation. 23, 4, 291--303.Google ScholarCross Ref
- Dwyer, J. 2002. Communication in Business: Strategies and Skills. Prentice Hall.Google Scholar
- Eickhoff, C. et al. 2013. Copulas for Information Retrieval. Proceedings of SIGIR'2013 (New York, NY, USA), 663--672. Google ScholarDigital Library
- Eickhoff, C. and Vries, A.P. de 2013. Increasing cheat robustness of crowdsourcing tasks. Information Retrieval. 16, 2, 121--137. Google ScholarDigital Library
- Franklin, S.B. et al. 1995. Parallel Analysis: a method for determining significant principal components. Journal of Vegetation Science. 6, 1, 99--106.Google ScholarCross Ref
- Furr, M. 2011. Scale Construction and Psychometrics for Social and Personality Psychology. SAGE.Google Scholar
- Goldberg, L.R. and Kilkowski, J.M. 1985. The prediction of semantic consistency in self-descriptions: characteristics of persons and of terms that affect the consistency of responses to synonym and antonym pairs. Journal of personality and social psychology. 48, 1, 82--98.Google ScholarCross Ref
- Green, R. 1995. Topical relevance relationships. I. Why topic matching fails. JASIS. 46, 9, 646--653. Google ScholarDigital Library
- Greisdorf, H. 2003. Relevance thresholds: a multi-stage predictive model of how users evaluate information. IP & M, 403--423. Google ScholarDigital Library
- Grice, H.P. 1989. Studies in the way of words. Harvard University Press.Google Scholar
- Gwizdka, J. 2014. News Stories Relevance Effects on Eye-movements. Proceedings of the Symposium on Eye Tracking Research and Applications, 283--286. Google ScholarDigital Library
- Harter, S.P. 1992. Psychological relevance and information science. JASIS. 43, 9, 602--615.Google ScholarCross Ref
- Hatcher, L. 2013. Advanced statistics in research: reading, understanding, and writing up data analysis results. ShadowFinch Media, LLC.Google Scholar
- Hjørland, B. and Christensen, F.S. 2002. Work tasks and socio-cognitive relevance: A specific example. JASIST. 53, 11, 960--965. Google ScholarDigital Library
- Hosseini, M. et al. 2012. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents. Advances in Information Retrieval. R. Baeza-Yates et al., eds. Springer Berlin Heidelberg. 182--194. Google ScholarDigital Library
- Hox, J.J. and Bechger, T.M. 2007. An introduction to structural equation modeling.Google Scholar
- Hu, L. and Bentler, P.M. 1999. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal. 6, 1, 1--55.Google ScholarCross Ref
- Huang, X. and Soergel, D. 2013. Relevance: An improved framework for explicating the notion. JASIST. 64, 1, 18--35. Google ScholarDigital Library
- Johnson, J.R. et al. 1981. Characteristics of Errors in Accounts Receivable and Inventory Audits. The Accounting Review. 56, 2, 270--293.Google Scholar
- Kazai, G. et al. 2012. An Analysis of Systematic Judging Errors in Information Retrieval. Proceedings of CIKM'2012 (New York, NY, USA), 105--114. Google ScholarDigital Library
- Kazai, G. et al. 2011. Crowdsourcing for Book Search Evaluation: Impact of Hit Design on Comparative System Ranking. Proceedings of SIGIR'2011 (New York, NY, USA), 205--214. Google ScholarDigital Library
- Kittur, A. et al. 2008. Crowdsourcing User Studies with Mechanical Turk. Proceedings of SIGCHI'2008 (New York, NY, USA), 453--456. Google ScholarDigital Library
- Lancaster, F.W. 1968. Information retrieval systems: characteristics, testing, and evaluation. Wiley.Google Scholar
- Lesk, M.E. and Salton, G. 1968. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval. 4, 4, 343--359.Google ScholarCross Ref
- Levitin, A. and Redman, T. 1995. Quality dimensions of a conceptual view. IP & M. 31, 1, 81--88. Google ScholarDigital Library
- Little, G. 2009. TurKit: Tools for iterative tasks on mechanical turk. IEEE Symposium on Visual Languages and Human-Centric Computing, 252--253. Google ScholarDigital Library
- Liu, T.-Y. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3, 225--331. Google ScholarDigital Library
- M, P. and Bonett, D.G. 1980. Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin. 88, 3, 588--606.Google ScholarCross Ref
- Maron, M.E. 1977. On indexing, retrieval and the meaning of about. JASIS. 28, 1, 38--43.Google ScholarCross Ref
- Marshall, C.C. and Shipman, F.M. 2013. Experiences Surveying the Crowd: Reflections on Methods, Participation, and Reliability. Proceedings of the 5th Annual ACM Web Science Conference, 234--243. Google ScholarDigital Library
- Mizzaro, S. 1997. Relevance: The whole history. JASIS. 48, 9, 810--832. Google ScholarDigital Library
- Moshfeghi, Y. et al. 2013. Understanding Relevance: An fMRI Study. Advances in Information Retrieval. P. Serdyukov et al., eds. Springer Berlin Heidelberg. 14--25. Google ScholarDigital Library
- Mueller, R.O. and Hancock, G.R. 2008. Best practices in structural equation modeling. Best practices in quantitative methods. 488--508.Google Scholar
- Murphy, K.P. 2012. Machine Learning: A Probabilistic Perspective. Mit Press. Google ScholarDigital Library
- Pearson - Modern Measurement: Theory, Principles, and Applications of Mental Appraisal, 2/E - Steven J. Osterlind: http://www.pearsonhighered.com/educator/product/Modern-Measurement-Theory-Principles-and-Applications-of-Mental-Appraisal/9780137010257.page. Accessed: 2014-01--24.Google Scholar
- Principles and Practice of Structural Equation Modeling: Third Edition: http://www.guilford.com/cgi-bin/cartscript.cgi?page=pr/kline.htm & dir=research/res_quant. Accessed: 2014-01--24.Google Scholar
- Proceedings of the International Conference on Scientific Information -- Two Volumes: http://books.nap.edu/openbook.php?record_id=10866 & page=687. Accessed: 2014-01-26.Google Scholar
- Rees, A.M. and Schultz, D.G. 1967. A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching. Final Report to the National Science Foundation. Volume I.Google Scholar
- Relevance as process: judgements in the context of scholarly research: http://www.informationr.net/ir/10--2/paper226. Accessed: 2014-01--24.Google Scholar
- Sanderson, M. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval. 4, 4, 247--375.Google ScholarCross Ref
- Saracevic, T. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. JASIST. 58, 13, 1915--1933. Google ScholarDigital Library
- Saracevic, T. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. JASIST. 58, 13, 2126--2144. Google ScholarDigital Library
- Schamber, L. 1994. Relevance and Information Behavior. Annual Review of Information Science and Technology (ARIST). 29, 3--48.Google Scholar
- Scheines, R. et al. 1999. Bayesian estimation and testing of structural equation models. Psychometrika. 64, 1, 37--52.Google ScholarCross Ref
- Tabachnick, B.G. and Fidell, L.S. 2012. Using Multivariate Statistics. Pearson Education, Limited.Google Scholar
- Tang, R. and Solomon, P. 1998. Toward an understanding of the dynamics of relevance judgment: An analysis of one person's search behavior. IP & M. 34, 2--3, 237--256. Google ScholarDigital Library
- Taylor, A.R. et al. 2007. Relationships between categories of relevance criteria and stage in task completion. IP & M. 43, 4, 1071--1084. Google ScholarDigital Library
- The Social Construction of Meaning: An Alternative Perspective on Information Sharing: 2003. http://pubsonline.informs.org/doi/abs/10.1287/isre.14.1.87.14765. Accessed: 2014-01--24. Google ScholarDigital Library
- Tsikrika, T. and Lalmas, M. 2007. Combining Evidence for Relevance Criteria: A Framework and Experiments in Web Retrieval. Advances in Information Retrieval. G. Amati et al., eds. Springer Berlin Heidelberg. 481--493. Google ScholarDigital Library
- Vakkari, P. and Hakala, N. 2000. Changes in relevance criteria and problem stages in task performance. Journal of Documentation. 56, 5, 540--562.Google ScholarCross Ref
- Voorhees, E.M. 1998. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Proceedings of SIGIR'1998 (New York, NY, USA), 315--323. Google ScholarDigital Library
- Wilson, D. and Sperber, D. 2002. Relevance Theory. Handbook of Pragmatics. G. Ward and L. Horn, eds. Blackwell.Google Scholar
- De Winter, J.C.F. and Dodou, D. 2012. Factor recovery by principal axis factoring and maximum likelihood factor analysis as a function of factor pattern and sample size. Journal of Applied Statistics. 39, 4, 695--710.Google ScholarCross Ref
- Worthington, R.L. and Whittaker, T.A. 2006. Scale Development Research A Content Analysis and Recommendations for Best Practices. The Counseling Psychologist. 34, 6, 806--838.Google ScholarCross Ref
- Wright, S. Correlation and causation.Google Scholar
- Xu, Y. (Calvin) and Chen, Z. 2006. Relevance judgment: What do information users consider beyond topicality? JASIST. 57, 7, 961--973. Google ScholarDigital Library
- Zuccon, G. et al. 2013. Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems. Information Retrieval. 16, 2, 267--305. Google ScholarDigital Library
Index Terms
- Multidimensional relevance modeling via psychometrics and crowdsourcing
Recommendations
Crowd Worker Strategies in Relevance Judgment Tasks
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data MiningCrowdsourcing is a popular technique to collect large amounts of human-generated labels, such as relevance judgments used to create information retrieval (IR) evaluation collections. Previous research has shown how collecting high quality labels from a ...
Comparing In Situ and Multidimensional Relevance Judgments
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalTo address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary ...
On the role of human and machine metadata in relevance judgment tasks
AbstractIn order to evaluate the effectiveness of Information Retrieval (IR) systems it is key to collect relevance judgments from human assessors. Crowdsourcing has successfully been used as a method to scale-up the collection of manual ...
Highlights- Introducing metadata improves the efficiency of crowd workers performing relevance judgements.
Comments