ABSTRACT
Cranfield-style information retrieval evaluation considers variance in user information needs by evaluating retrieval systems over a set of search topics. For each search topic, traditional metrics model all users searching ranked lists in exactly the same manner and thus have zero variance in their per-topic estimate of effectiveness. Metrics that fail to model user variance overestimate the effect size of differences between retrieval systems. The modeling of user variance is critical to understanding the impact of effectiveness differences on the actual user experience. If the variance of a difference is high, the effect on user experience will be low. Time-biased gain is an evaluation metric that models user interaction with ranked lists that are displayed using document surrogates. In this paper, we extend the stochastic simulation of time-biased gain to model the variation between users. We validate this new version of time-biased gain by showing that it produces distributions of gain that agree well with actual distributions produced by real users. With a per-topic variance in its effectiveness measure, time-biased gain allows for the measurement of the effect size of differences, which allows researchers to understand the extent to which predicted performance improvements matter to real users.
- Aula, A., Majaranta, P., and Räihä, K.-J. Eye-tracking reveals the personal styles for search result evaluation. In Human-Computer Interaction -- INTERACT 2005, vol. 3585 of LNCS, Springer (2005), 1058--1061. Google ScholarDigital Library
- Azzopardi, L. The economics in interactive information retrieval. In SIGIR, (2011), 15--24. Google ScholarDigital Library
- Azzopardi, L., Järvelin, K., Kamps, J., and Smucker, M. D. Report on the SIGIR 2010 workshop on the simulation of interaction. SIGIR Forum, (January 2011), 35--47. Google ScholarDigital Library
- Baeza-Yates, R., Hurtado, C., Mendoza, M., and Dupret, G. Modeling user search behavior. In Proceedings of the Third Latin American Web Conference, IEEE (2005), 242--251. Google ScholarDigital Library
- Carterette, B., Kanoulas, E., and Yilmaz, E. Simulating simple user behavior for system effectiveness evaluation. In CIKM, (2011), 611--620. Google ScholarDigital Library
- Chi, E. H., Pirolli, P., Chen, K., and Pitkow, J. Using information scent to model user information needs and actions and the web. In SIGCHI, (2001), 490--497. Google ScholarDigital Library
- Clarke, C. L., Craswell, N., Soboroff, I., and Ashkan, A. A comparative analysis of cascade measures for novelty and diversity. In WSDM, (2011), 75--84. Google ScholarDigital Library
- Cormack, G. V., and Lynam, T. R. Statistical precision of information retrieval evaluation. In SIGIR, (2006), 533--540. Google ScholarDigital Library
- Dumais, S. T., Buscher, G., and Cutrell, E. Individual differences in gaze patterns for web search. In IIiX, (2010), 185--194. Google ScholarDigital Library
- Dunlop, M. D. Time, relevance and interaction modelling for information retrieval. In SIGIR, (1997), 206--213. Google ScholarDigital Library
- Grissom, R. J., and Kim, J. J. Effect Sizes for Research, 2nd ed. Routledge, Taylor and Francis Group, 2012.Google Scholar
- Hersh, W., Turpin, A., Price, S., Chan, B., Kramer, D., Sacherek, L., and Olson, D. Do batch and user evaluations give the same results? In SIGIR, (2000), 17--24. Google ScholarDigital Library
- Järvelin, K., and Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. TOIS, (2002), 20(4): 422--446. Google ScholarDigital Library
- Keskustalo, H., Järvelin, K., Sharma, T., and Nielsen, M. L. Test collection-based IR evaluation needs extension toward sessions: A case of extremely short queries. In AIRS, (2009), 63--74. Google ScholarDigital Library
- Lin, J., and Smucker, M. D. How do users find things with PubMed? Towards automatic utility evaluation with user simulations. In SIGIR, (2008), 19--26. Google ScholarDigital Library
- Lipsey, M. W., and Wilson, D. B. Practical Meta-Analysis. Sage Publications, Inc., 2001.Google Scholar
- Moffat, A., and Zobel, J. Rank-biased precision for measurement of retrieval effectiveness. TOIS, (2008), 27(1): 1--27. Google ScholarDigital Library
- O'Brien, M., Keane, M. T., and Smyth, B. Predictive modeling of first-click behavior in web-search. In WWW, (2006), 1031--1032. Google ScholarDigital Library
- Pavlu, V., Rajput, S., Golbus, P. B., and Aslam, J. A. IR system evaluation using nugget-based test collections. In WSDM, (2012), 393--402. Google ScholarDigital Library
- Robertson, S. A new interpretation of average precision. In SIGIR, (2008), 689--690. Google ScholarDigital Library
- Smith, C. L., and Kantor, P. B. User adaptation: good results from poor systems. In SIGIR, (2008), 147--154. Google ScholarDigital Library
- Smucker, M. D. An analysis of user strategies for examining and processing ranked lists of documents. In HCIR, (2011).Google Scholar
- Smucker, M. D., and Clarke, C. L. A. Stochastic simulation of time-biased gain. To appear in CIKM, (2012), 5 pages. Google ScholarDigital Library
- Smucker, M. D., and Clarke, C. L. A. Time-based calibration of effectiveness measures. In SIGIR, (2012), 95--104. Google ScholarDigital Library
- Smucker, M. D., and Jethani, C. Human performance and retrieval precision revisited. In SIGIR, (2010), 595--602. Google ScholarDigital Library
- Voorhees, E. M. Overview of the TREC 2005 Robust Retrieval Track. In TREC, (2005).Google Scholar
- Voorhees, E. M. I come not to bury Cranfield, but to praise it. In HCIR, (2009), 13--16.Google Scholar
- Voorhees, E. M., and Harman, D. K., Eds. TREC. MIT Press, 2005.Google Scholar
- Weiss, E. N., Cohen, M. A., and Hershey, J. C. An iterative estimation and validation procedure for specification of semi-Markov models with application to hospital patient flow. Operations Research, (1982), pp. 1082--1104.Google Scholar
- White, R. W., Ruthven, I., Jose, J. M., and van Rijsbergen, C. J. Evaluating implicit feedback models using searcher simulations. TOIS, (2005), 23(3): 325--361. Google ScholarDigital Library
- Yilmaz, E., Shokouhi, M., Craswell, N., and Robertson, S. Expected browsing utility for web search evaluation (2010). In CIKM, (2010), 1561--1564. Google ScholarDigital Library
- Zhang, Y., Park, L. A., and Moffat, A. Click-based evidence for decaying weight distributions in search effectiveness metrics. Information Retrieval (2010), 13: 46--69. Google ScholarDigital Library
Index Terms
- Modeling user variance in time-biased gain
Recommendations
Time-based calibration of effectiveness measures
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalMany current effectiveness measures incorporate simplifying assumptions about user behavior. These assumptions prevent the measures from reflecting aspects of the search process that directly impact the quality of retrieval results as experienced by the ...
Stochastic simulation of time-biased gain
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementTime-biased gain provides a unifying framework for information retrieval evaluation, generalizing many traditional effectiveness measures while accommodating aspects of user behavior not captured by these measures. By using time as a basis for ...
SIGIR 2013 workshop on modeling user behavior for information retrieval evaluation
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalThe SIGIR 2013 Workshop on Modeling User Behavior for Information Retrieval Evaluation (MUBE 2013) brings together people to discuss existing and new approaches, ways to collaborate, and other ideas and issues involved in improving information retrieval ...
Comments