ABSTRACT
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
- B. Carterette. Low-Cost and Robust Evaluation of Information Retrieval Systems. Ph.D. dissertation, Department of Computer Science, University of Massachusetts Amherst, 2008. Google ScholarDigital Library
- B. Carterette. Robust Test Collections for Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 55--62, 2007. Google ScholarDigital Library
- B. Carterette, J. Allan, and R. Sitaraman. Minimal Test Collections for Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 268--275, 2006. Google ScholarDigital Library
- J.S. Downie. The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future. Computer Music Journal. 28(2): 12--23, 2004. Google ScholarDigital Library
- J.S. Downie, A.F. Ehmann, M. Bay, and M.C. Jones. . The Music Information Retrieval Evaluation eXchange: Some Observations and Insights. In Advances in Music Information Retrieval, W.R. Zbigniew and A.A. Wieczorkowska, eds. Springer. 2010, 93--115.Google Scholar
- J. Urbano. Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain. In International Society for Music Information Retrieval Conference, pages 609--614, 2011.Google Scholar
- J. Urbano, D. Martín, M. Marrero, and J. Morato. Audio Music Similarity and Retrieval: Evaluation Power and Stability. In International Society for Music Information Retrieval Conference, pages 597--602, 2011.Google Scholar
- E.M. Voorhees. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Information Processing and Management. 36(5): 697--716, 2000. Google ScholarDigital Library
- E.M. Voorhees and D.K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
Index Terms
- Towards minimal test collections for evaluation of audio music similarity and retrieval
Recommendations
Minimal test collections for retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalAccurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In ...
Music similarity and retrieval
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalThis tutorial serves as an introductory course to the field of and state-of-the-art in music information retrieval (MIR) and in particular to music similarity estimation which is an essential component of music retrieval. Apart from explaining ...
Robust test collections for retrieval evaluation
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalLow-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time ...
Comments