Abstract
Information retrieval research has demonstrated that system performance does not always correlate positively with user performance, and that users often assign positive evaluation scores to search systems even when they are unable to complete tasks successfully. This research investigated the relationship between objective measures of system performance and users' perceptions of that performance. In this study, subjects evaluated the performance of four search systems whose search results were manipulated systematically to produce different orderings and numbers of relevant documents. Three laboratory studies were conducted with a total of eighty-one subjects. The first two studies investigated the effect of the order of five relevant and five nonrelevant documents in a search results list containing ten results on subjects' evaluations. The third study investigated the effect of varying the number of relevant documents in a search results list containing ten results on subjects' evaluations. Results demonstrate linear relationships between subjects' evaluations and the position of relevant documents in a search results list and the total number of relevant documents retrieved. Of the two, number of relevant documents retrieved was a stronger predictor of subjects' evaluation ratings and resulted in subjects using a greater range of evaluation scores.
- Allan, J. 2006. HARD Track overview in TREC 2005 high accuracy retrieval from documents. In Proceedings of the Text Retrieval Conference (TREC-2005). E. M. Voorhees and L. P. Buckland, Eds. Government Printing Office, Washington, D.C.Google Scholar
- Allan, J., Carterette, B., and Lewis, J. 2005. When will information retrieval be ‘good enough’? In Proceedings of the 28th Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR). 433--440. Google ScholarDigital Library
- Al-Maskari, A., Sanderson, M., and Clough, P. 2007. The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the 30th Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR). 773--774. Google ScholarDigital Library
- Bar-Ilan, J., Keenoy, K., Yaari, E., and Levene, M. 2007. User rankings of search engine results. J. Amer. Soc. Inform. Sci. Tech. 58, 9, 1254--1266. Google ScholarDigital Library
- Blair, D. C. and Maron, M. E. 1985. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Comm. ACM, 28, 3, 289--299. Google ScholarDigital Library
- Borlund, P. 2003a. The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. Inform. Res. 8, 3, no. 152.Google Scholar
- Borlund, P. 2003b. The concept of relevance in IR. J. Ameri. Soc. Inform. Sci. Tech. 54, 10, 913--925. Google ScholarDigital Library
- Chen, H. and Dumais, S. 2000. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 145--152. Google ScholarDigital Library
- Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences 2nd Ed. Lawrence Earlbaum Associates, Hillsdale, NJ.Google Scholar
- Cutrell, E. and Guan, Z. 2007. What are you looking for? An eye-tracking study of information usage in Web search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (SIGCHI). 407--416. Google ScholarDigital Library
- Dumais, S. T. and Belkin, N. J. 2005. The TREC interactive tracks: Putting the user into search. In TREC: Experiment and Evaluation in Information Retrieval. E. M. Voorhees and D. K. Harman Eds. MIT Press, 123--153.Google Scholar
- Fox, S., Karnawat, K., Mydland, M., Dumais, S., and White, T. 2005. Evaluating implicit measures to improve Web search. ACM Trans. Inform. Syst. 23, 2, 147--168. Google ScholarDigital Library
- Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., and Olson, D. 2000. Do batch and user evaluations give the same results? In Proceedings of the 23rd Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR). 17--24. Google ScholarDigital Library
- Hornbæk, K. and Law, E. L.-C. 2007. Meta-analysis of correlations among usability measures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (SIGCHI). 617--626. Google ScholarDigital Library
- Huffman, S. B. and Hochster, M. 2007. How well does result relevance predict session satisfaction? In Proceedings of 30th Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR). 567--573. Google ScholarDigital Library
- Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., and Gay, G. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM Trans. Inform. Syst. 25, 2. Google ScholarDigital Library
- Kaki, M. and Aula, A. 2008. Controlling the complexity in comparing search user interfaces via user studies. Inform. Proc. Manag. 44, 1, 82--91. Google ScholarDigital Library
- Kelly, D., Shah, C., Sugimoto, C. R., Bailey, E. W., Clemens, R. A., Irvine, A. K., Johnson, N. A., Ke, W., Oh, S., Poljakova, A., Rodriguez, M. A., Van Noord, M. G., and Zhang, Y. 2008. Effects of performance feedback on users' evaluations of an interactive IR system. In Proceedings of the 2nd Symposium on Information Interaction in Context (IIiX). 75--82. Google ScholarDigital Library
- Lee, H.-J., Belkin, N. J., and Krovetz, B. 2006. Rutgers information retrieval performance evaluation project. J. Korean Soc. Inform. Manag., 23, 2, 98--111.Google Scholar
- Nielsen, J. and Levy, J. 1994. Measuring usability: Preference vs. performance. Comm. ACM, 37, 4, 66--75. Google ScholarDigital Library
- Spink, A. 2002. A user-centered approach to evaluating human interaction with Web search engines: An exploratory study. Inform. Proc. Manag. 38, 401--426. Google ScholarDigital Library
- Spink, A. and Jansen, B. J. 2004. Web Search: Public Searching of the Web. Kluwer Academic Publishers. Google ScholarDigital Library
- SU, L. T. 2003. A comprehensive and systematic model of user evaluation of Web search engines: II. An evaluation by undergraduates. J. Amer. Soc. Inform. Sci. Tech. 54, 13, 1193--1223. Google ScholarDigital Library
- Thomas, P. and Hawking, D. 2006. Evaluation by comparing result sets in context. In Proceedings of the Conference on Information and Knowledge Management (CIKM). 94--101. Google ScholarDigital Library
- Toms, E. G., Freund, L., and LI, C. 2004. WiIRE: The Web interactive information retrieval experimentation system prototype. Inform. Proc. Manag. 40, 4, 655--675. Google ScholarDigital Library
- Turpin, A. and Hersh, W. 2001. Why batch and user evaluations do not give the same results. In Proceedings of the 24th Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR). 225--231. Google ScholarDigital Library
- Turpin, A. and Scholer, F. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR). 11--18. Google ScholarDigital Library
- Voorhees, E. M. and Harman, D. K. 2005. TREC: Experiment and Evaluation in Information Retrieval, MIT Press, Cambridge, MA. Google ScholarDigital Library
Index Terms
- Effects of position and number of relevant documents retrieved on users' evaluations of system performance
Recommendations
Using hit curves to compare search algorithm performance
Databases continue to grow but the metrics available to evaluate information retrieval systems have not changed. Large collections such as MEDLINE and the World Wide Web contain many relevant documents for common queries. Ranking is therefore ...
Identification of top relevant temporal expressions in documents
TempWeb '12: Proceedings of the 2nd Temporal Web Analytics WorkshopTemporal information is very common in textual documents, and thus, identifying, normalizing, and organizing temporal expressions is an important task in IR. Although there are some tools for temporal tagging, there is a lack in research focusing on the ...
On identifying representative relevant documents
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementUsing relevance feedback can significantly improve the effectiveness of ad hoc (query-based) retrieval. However, retrieval performance can significantly vary with respect to the given set of relevant documents. Our goal is to establish a quantitative ...
Comments