research-article

Multidimensional relevance modeling via psychometrics and crowdsourcing

Authors:
Yinglong Zhang

University of Texas at Austin, Austin, TX, USA

University of Texas at Austin, Austin, TX, USA
View Profile

,
Jin Zhang

University of Texas at Austin, Austin, TX, USA

University of Texas at Austin, Austin, TX, USA
View Profile

,
Matthew Lease

University of Texas at Austin, Austin, TX, USA

University of Texas at Austin, Austin, TX, USA
View Profile

,
Jacek Gwizdka

University of Texas at Austin, Austin, TX, USA

University of Texas at Austin, Austin, TX, USA
View Profile

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalJuly 2014Pages 435–444https://doi.org/10.1145/2600428.2609577

Published:03 July 2014Publication History

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Pages 435–444

ABSTRACT

While many multidimensional models of relevance have been posited, prior studies have been largely exploratory rather than confirmatory. Lacking a methodological framework to quantify the relationships among factors or measure model fit to observed data, many past models could not be empirically tested or falsified. To enable more positivist experimentation, Xu and Chen [77] proposed a psychometric framework for multidimensional relevance modeling. However, we show their framework exhibits several methodological limitations which could call into question the validity of findings drawn from it. In this work, we identify and address these limitations, scale their methodology via crowdsourcing, and describe quality control methods from psychometrics which stand to benefit crowdsourcing IR studies in general. Methodology we describe for relevance judging is expected to benefit both human-centered and systems-centered IR.

References

Alonso, O. 2013. Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Information Retrieval. 16, 2, 101--120. Google ScholarDigital Library
Anderson, J.C. and Gerbing, D.W. 1988. Structural equation modeling in practice: A review and recommended two-step approach. Psychological Bulletin. 103, 3, 411--423.Google ScholarCross Ref
Bailey, P. et al. 2008. Relevance assessment: are judges exchangeable and does it matter. SIGIR'08, 667--674. Google ScholarDigital Library
Balatsoukas, P. and Ruthven, I. 2012. An eye-tracking approach to the analysis of relevance judgments on the Web: The case of Google search engine. JASIST. 63, 9, 1728--1746. Google ScholarDigital Library
Baokstein, A. 1979. Relevance. JASIS. 30, 5, 269--273.Google ScholarCross Ref
Barry, C.L. 1994. User-defined relevance criteria: An exploratory study. JASIS. 45, 3, 149--159. Google ScholarDigital Library
Barry, C.L. and Schamber, L. 1998. Users' criteria for relevance evaluation: A cross-situational comparison. IP & M. 34, 2--3, 219--236. Google ScholarDigital Library
Bateman, J. 1998. Changes in Relevance Criteria: A Longitudinal Study. Proceedings of the ASIS Annual Meeting. 35, 23--32.Google Scholar
Behrend, T.S. et al. 2011. The viability of crowdsourcing for survey research. Behavior research methods. 43, 3, 800--813.Google Scholar
Blanco, R. et al. 2011. Repeatable and Reliable Search System Evaluation Using Crowdsourcing. Proceedings of SIGIR'2011 New York, NY, USA, 923--932. Google ScholarDigital Library
Borlund, P. 2003. The concept of relevance in IR. JASIST. 54, 10, 913--925. Google ScholarDigital Library
Boyce, B. 1982. Beyond topicality: A two stage view of relevance and the retrieval process. IP & M 18, 3, 105--109.Google Scholar
Bradford, S.C. 1934. Sources of information on specific subjects. Engineering: An Illustrated Weekly Journal (London). 137, 26, 85--86.Google Scholar
Browne, M.W. 2000. Psychometrics. Journal of the American Statistical Association. 95, 450, 661--665.Google ScholarCross Ref
Cacioppo, J.T. and Petty, R.E. 1984. The Elaboration Likelihood Model of Persuasion. Advances in Consumer Research. 11, 1 673--675.Google Scholar
Chouldechova, A. and Mease, D. 2013. Differences in Search Engine Evaluations Between Query Owners and Non-owners. Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 103--112. Google ScholarDigital Library
Cognitive Interviewing: http://www.uk.sagepub.com/textbooks/Book225856?prodId=Book225856. Accessed: 2014-01--24.Google Scholar
Cohen, J. 1988. Statistical power analysis for the behavioral sciences. L. Erlbaum Associates.Google Scholar
Cool, C. et al. 1993. Characteristics of Texts affecting relevance judgements. Proceedings of the 14th National Online Meeting, 77--84.Google Scholar
Da Costa Pereira, C. et al. 2012. Multidimensional relevance: Prioritized aggregation in a personalized Information Retrieval setting. IP & M. 48, 2, 340--357. Google ScholarDigital Library
Cuadra, C.A. and Katter, R.V. 1967. Opening the Black Box of "Relevance." Journal of Documentation. 23, 4, 291--303.Google ScholarCross Ref
Dwyer, J. 2002. Communication in Business: Strategies and Skills. Prentice Hall.Google Scholar
Eickhoff, C. et al. 2013. Copulas for Information Retrieval. Proceedings of SIGIR'2013 (New York, NY, USA), 663--672. Google ScholarDigital Library
Eickhoff, C. and Vries, A.P. de 2013. Increasing cheat robustness of crowdsourcing tasks. Information Retrieval. 16, 2, 121--137. Google ScholarDigital Library
Franklin, S.B. et al. 1995. Parallel Analysis: a method for determining significant principal components. Journal of Vegetation Science. 6, 1, 99--106.Google ScholarCross Ref
Furr, M. 2011. Scale Construction and Psychometrics for Social and Personality Psychology. SAGE.Google Scholar
Goldberg, L.R. and Kilkowski, J.M. 1985. The prediction of semantic consistency in self-descriptions: characteristics of persons and of terms that affect the consistency of responses to synonym and antonym pairs. Journal of personality and social psychology. 48, 1, 82--98.Google ScholarCross Ref
Green, R. 1995. Topical relevance relationships. I. Why topic matching fails. JASIS. 46, 9, 646--653. Google ScholarDigital Library
Greisdorf, H. 2003. Relevance thresholds: a multi-stage predictive model of how users evaluate information. IP & M, 403--423. Google ScholarDigital Library
Grice, H.P. 1989. Studies in the way of words. Harvard University Press.Google Scholar
Gwizdka, J. 2014. News Stories Relevance Effects on Eye-movements. Proceedings of the Symposium on Eye Tracking Research and Applications, 283--286. Google ScholarDigital Library
Harter, S.P. 1992. Psychological relevance and information science. JASIS. 43, 9, 602--615.Google ScholarCross Ref
Hatcher, L. 2013. Advanced statistics in research: reading, understanding, and writing up data analysis results. ShadowFinch Media, LLC.Google Scholar
Hjørland, B. and Christensen, F.S. 2002. Work tasks and socio-cognitive relevance: A specific example. JASIST. 53, 11, 960--965. Google ScholarDigital Library
Hosseini, M. et al. 2012. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents. Advances in Information Retrieval. R. Baeza-Yates et al., eds. Springer Berlin Heidelberg. 182--194. Google ScholarDigital Library
Hox, J.J. and Bechger, T.M. 2007. An introduction to structural equation modeling.Google Scholar
Hu, L. and Bentler, P.M. 1999. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal. 6, 1, 1--55.Google ScholarCross Ref
Huang, X. and Soergel, D. 2013. Relevance: An improved framework for explicating the notion. JASIST. 64, 1, 18--35. Google ScholarDigital Library
Johnson, J.R. et al. 1981. Characteristics of Errors in Accounts Receivable and Inventory Audits. The Accounting Review. 56, 2, 270--293.Google Scholar
Kazai, G. et al. 2012. An Analysis of Systematic Judging Errors in Information Retrieval. Proceedings of CIKM'2012 (New York, NY, USA), 105--114. Google ScholarDigital Library
Kazai, G. et al. 2011. Crowdsourcing for Book Search Evaluation: Impact of Hit Design on Comparative System Ranking. Proceedings of SIGIR'2011 (New York, NY, USA), 205--214. Google ScholarDigital Library
Kittur, A. et al. 2008. Crowdsourcing User Studies with Mechanical Turk. Proceedings of SIGCHI'2008 (New York, NY, USA), 453--456. Google ScholarDigital Library
Lancaster, F.W. 1968. Information retrieval systems: characteristics, testing, and evaluation. Wiley.Google Scholar
Lesk, M.E. and Salton, G. 1968. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval. 4, 4, 343--359.Google ScholarCross Ref
Levitin, A. and Redman, T. 1995. Quality dimensions of a conceptual view. IP & M. 31, 1, 81--88. Google ScholarDigital Library
Little, G. 2009. TurKit: Tools for iterative tasks on mechanical turk. IEEE Symposium on Visual Languages and Human-Centric Computing, 252--253. Google ScholarDigital Library
Liu, T.-Y. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3, 225--331. Google ScholarDigital Library
M, P. and Bonett, D.G. 1980. Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin. 88, 3, 588--606.Google ScholarCross Ref
Maron, M.E. 1977. On indexing, retrieval and the meaning of about. JASIS. 28, 1, 38--43.Google ScholarCross Ref
Marshall, C.C. and Shipman, F.M. 2013. Experiences Surveying the Crowd: Reflections on Methods, Participation, and Reliability. Proceedings of the 5th Annual ACM Web Science Conference, 234--243. Google ScholarDigital Library
Mizzaro, S. 1997. Relevance: The whole history. JASIS. 48, 9, 810--832. Google ScholarDigital Library
Moshfeghi, Y. et al. 2013. Understanding Relevance: An fMRI Study. Advances in Information Retrieval. P. Serdyukov et al., eds. Springer Berlin Heidelberg. 14--25. Google ScholarDigital Library
Mueller, R.O. and Hancock, G.R. 2008. Best practices in structural equation modeling. Best practices in quantitative methods. 488--508.Google Scholar
Murphy, K.P. 2012. Machine Learning: A Probabilistic Perspective. Mit Press. Google ScholarDigital Library
Pearson - Modern Measurement: Theory, Principles, and Applications of Mental Appraisal, 2/E - Steven J. Osterlind: http://www.pearsonhighered.com/educator/product/Modern-Measurement-Theory-Principles-and-Applications-of-Mental-Appraisal/9780137010257.page. Accessed: 2014-01--24.Google Scholar
Principles and Practice of Structural Equation Modeling: Third Edition: http://www.guilford.com/cgi-bin/cartscript.cgi?page=pr/kline.htm & dir=research/res_quant. Accessed: 2014-01--24.Google Scholar
Proceedings of the International Conference on Scientific Information -- Two Volumes: http://books.nap.edu/openbook.php?record_id=10866 & page=687. Accessed: 2014-01-26.Google Scholar
Rees, A.M. and Schultz, D.G. 1967. A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching. Final Report to the National Science Foundation. Volume I.Google Scholar
Relevance as process: judgements in the context of scholarly research: http://www.informationr.net/ir/10--2/paper226. Accessed: 2014-01--24.Google Scholar
Sanderson, M. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval. 4, 4, 247--375.Google ScholarCross Ref
Saracevic, T. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. JASIST. 58, 13, 1915--1933. Google ScholarDigital Library
Saracevic, T. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. JASIST. 58, 13, 2126--2144. Google ScholarDigital Library
Schamber, L. 1994. Relevance and Information Behavior. Annual Review of Information Science and Technology (ARIST). 29, 3--48.Google Scholar
Scheines, R. et al. 1999. Bayesian estimation and testing of structural equation models. Psychometrika. 64, 1, 37--52.Google ScholarCross Ref
Tabachnick, B.G. and Fidell, L.S. 2012. Using Multivariate Statistics. Pearson Education, Limited.Google Scholar
Tang, R. and Solomon, P. 1998. Toward an understanding of the dynamics of relevance judgment: An analysis of one person's search behavior. IP & M. 34, 2--3, 237--256. Google ScholarDigital Library
Taylor, A.R. et al. 2007. Relationships between categories of relevance criteria and stage in task completion. IP & M. 43, 4, 1071--1084. Google ScholarDigital Library
The Social Construction of Meaning: An Alternative Perspective on Information Sharing: 2003. http://pubsonline.informs.org/doi/abs/10.1287/isre.14.1.87.14765. Accessed: 2014-01--24. Google ScholarDigital Library
Tsikrika, T. and Lalmas, M. 2007. Combining Evidence for Relevance Criteria: A Framework and Experiments in Web Retrieval. Advances in Information Retrieval. G. Amati et al., eds. Springer Berlin Heidelberg. 481--493. Google ScholarDigital Library
Vakkari, P. and Hakala, N. 2000. Changes in relevance criteria and problem stages in task performance. Journal of Documentation. 56, 5, 540--562.Google ScholarCross Ref
Voorhees, E.M. 1998. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Proceedings of SIGIR'1998 (New York, NY, USA), 315--323. Google ScholarDigital Library
Wilson, D. and Sperber, D. 2002. Relevance Theory. Handbook of Pragmatics. G. Ward and L. Horn, eds. Blackwell.Google Scholar
De Winter, J.C.F. and Dodou, D. 2012. Factor recovery by principal axis factoring and maximum likelihood factor analysis as a function of factor pattern and sample size. Journal of Applied Statistics. 39, 4, 695--710.Google ScholarCross Ref
Worthington, R.L. and Whittaker, T.A. 2006. Scale Development Research A Content Analysis and Recommendations for Best Practices. The Counseling Psychologist. 34, 6, 806--838.Google ScholarCross Ref
Wright, S. Correlation and causation.Google Scholar
Xu, Y. (Calvin) and Chen, Z. 2006. Relevance judgment: What do information users consider beyond topicality? JASIST. 57, 7, 961--973. Google ScholarDigital Library
Zuccon, G. et al. 2013. Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems. Information Retrieval. 16, 2, 267--305. Google ScholarDigital Library

Index Terms

Multidimensional relevance modeling via psychometrics and crowdsourcing
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Crowd Worker Strategies in Relevance Judgment Tasks
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

Crowdsourcing is a popular technique to collect large amounts of human-generated labels, such as relevance judgments used to create information retrieval (IR) evaluation collections. Previous research has shown how collecting high quality labels from a ...
Read More
Comparing In Situ and Multidimensional Relevance Judgments
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

To address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary ...
Read More
On the role of human and machine metadata in relevance judgment tasks
Abstract
In order to evaluate the effectiveness of Information Retrieval (IR) systems it is key to collect relevance judgments from human assessors. Crowdsourcing has successfully been used as a method to scale-up the collection of manual ...
Highlights
- Introducing metadata improves the efficiency of crowd workers performing relevance judgements.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
July 2014
1330 pages
ISBN:9781450322577
DOI:10.1145/2600428
General Chairs:
Shlomo Geva
Queensland University of Technology
,
Andrew Trotman
University of Dunedin
,
Program Chairs:
Peter Bruza
Queensland University of Technology
,
Charles L.A. Clarke
University of Waterloo
,
Kal Järvelin
University of Tampere
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 July 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowdsourcing
psychometrics
relevance judgment
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '14 Paper Acceptance Rate82of387submissions,21%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 41
  Total Citations
  View Citations
- 533
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multidimensional relevance modeling via psychometrics and crowdsourcing

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Crowd Worker Strategies in Relevance Judgment Tasks

Comparing In Situ and Multidimensional Relevance Judgments

On the role of human and machine metadata in relevance judgment tasks