ABSTRACT
Existing approaches used for training and evaluating search engines often rely on crowdsourced assessments of document relevance with respect to a user query. To use such assessments for either evaluation or learning, we propose a new framework for the inference of true document relevance from crowdsourced data---one simpler than previous approaches and achieving better performance. For each assessor, we model assessor quality and bias in the form of Gaussian distributed class conditionals of relevance grades. For each document, we model true relevance and difficulty as continuous variables. We estimate all parameters from crowdsourced data, demonstrating better inference of relevance as well as realistic models for both documents and assessors.
A document-pair likelihood model works best, and it is extended to pairwise learning to rank. Utilizing more information directly from the input data, it shows better performance as compared to existing state-of-the-art approaches for learning to rank from crowdsourced assessments. Experimental validation is performed on four TREC datasets.
- D. Andrich. A rating formulation for ordered response categories. Psychometrika, 43:561--573, 1978.Google ScholarCross Ref
- Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). 2006. Google ScholarDigital Library
- C.J.C. Burges. From ranknet to lambdarank to lambdamart: An overview, 2010.Google Scholar
- M. Lease C. Buckley and M. D. Smucker. Overview of the TREC 2010 Relevance Feedback Track (Notebook). In TREC, 2010.Google Scholar
- O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. JMLR, 14:1--24, 2011.Google Scholar
- X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In WSDM, pages 193--202, 2013. Google ScholarDigital Library
- A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 28(1):20--28, 1979.Google ScholarCross Ref
- J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2000.Google Scholar
- Y. Ganjisaffar, R. Caruana, and C. V. Lopes. Bagging gradient-boosted trees for high precision, low variance ranking models. SIGIR, pages 85--94, 2011. Google ScholarDigital Library
- M. Hosseini, I. J. Cox, N. Milic-Frayling, G. Kazai, and V. Vinay. On aggregating labels from multiple crowd workers to infer relevance of documents. In ECIR, 2012. Google ScholarDigital Library
- V. E. Johnson. On bayesian analysis of multirater ordinal data: An application to automated essay grading. Journal of the American Statistical Association, 91(433):42--51, 1996.Google ScholarCross Ref
- Chao L. and Y.-M. Wang. Truelabel confusions: A spectrum of probabilistic models in analyzing multiple ratings. In ICML, pages 225--232, 2012.Google Scholar
- B. Lakshminarayanan and Y. W. Teh. Inferring ground truth from multi-annotator ordinal data: A probabilistic approach. arXiv:1305.0015, 2013.Google Scholar
- Q. Liu, J. Peng, and A. T Ihler. Variational inference for crowdsourcing. In NIPS, pages 692--700. 2012.Google ScholarDigital Library
- G. N. Masters. A rasch model for partial credit scoring. Psychometrika, 47:149--174, 1982.Google ScholarCross Ref
- P. Metrikov, J. Wu, J. Anderton, V. Pavlu, and J. A. Aslam. A modification of lambdamart to handle noisy crowdsourced assessments. In ICTIR, 2013. Google ScholarDigital Library
- Paul Mineiro. Ordered values and mechanical turk. http://www.machinedlearnings.com, 2011.Google Scholar
- S. Niu, Y. Lan, J. Guo, X. Cheng, L. Yu, and G. Long. Listwise approach for rank aggregation in crowdsourcing. In WSDM, pages 253--262, 2015. Google ScholarDigital Library
- W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. 3 edition, 2007. Google ScholarDigital Library
- V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. J. Mach. Learn. Res., 11:1297--1322, 2010. Google ScholarDigital Library
- S. Rogers, M. Girolami, and T. Polajnar. Semi-parametric analysis of multi-rater data. Statistics and Computing, 20(3):317--334, 2010. Google ScholarDigital Library
- V. Sheng, F. Provost, and P. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD, 2008. Google ScholarDigital Library
- M. Smucker, G. Kazai, and M. Lease. Overview of the TREC 2013 Crowdsourcing Track. In TREC, 2014.Google Scholar
- M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. In CIKM, 2007. Google ScholarDigital Library
- J. S. Uebersax and W. M. Grove. A latent trait finite mixture model for the analysis of rating agreement. In Biometrics, December 1993.Google Scholar
- M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi. Community-based bayesian aggregation models for crowdsourcing. In WWW, 2014. Google ScholarDigital Library
- Maksims N. Volkovs and Richard S. Zemel. New learning methods for supervised and unsupervised preference aggregation. JMLR, 15:1135--1176, 2014. Google ScholarDigital Library
- T. P. Waterhouse. Pay by the bit: An information-theoretic metric for collective human judgment. In CSCW, pages 623--638, 2013. Google ScholarDigital Library
- J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. R. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, pages 2035--2043, 2009.Google ScholarDigital Library
- D. Zhou, Q. Liu, J. C. Platt, and C. Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. In ICML, 2014.Google ScholarDigital Library
- D. Zhou, J. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In NIPS, 2012.Google ScholarDigital Library
Index Terms
- Aggregation of Crowdsourced Ordinal Assessments and Integration with Learning to Rank: A Latent Trait Model
Recommendations
Learning to rank with groups
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementAn essential issue in document retrieval is ranking, and the documents are ranked by their expected relevance to a given query. Multiple labels are used to represent different level of relevance for documents to a given query, and the corresponding ...
Compression-Based Selective Sampling for Learning to Rank
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementLearning to rank (L2R) algorithms use a labeled training set to generate a ranking model that can be later used to rank new query results. These training sets are very costly and laborious to produce, requiring human annotators to assess the relevance or ...
Label Aggregation with Clustering for Biased Crowdsourced Labeling
ICMLC '22: Proceedings of the 2022 14th International Conference on Machine Learning and ComputingWith the rapid development of crowdsourcing learning, amount of label aggregation methods are proposed to infer the true labels of instances from multiple noisy labels provided by inexpert crowd workers. Most of the label aggregation methods take the ...
Comments