Abstract
Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrievaf probabilities for a particular normalization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collectzorz independent normalization technique. We use the idea of pivoting with the well known cosine normalization function. We point out some shortcomings of the cosine function andpresent two new normalization functions--pivoted unique normalization and piuotert byte size normalization.
- J. Broglio, J.P. CaJlan, W.B. Croft, and D.W. Nachbar. Document retrieval and routing using the INQUERY system. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TREC-3), pages 29- 38. NIST Special Publication 500--225, April 1995.Google Scholar
- Chris Buckley. The importance of proper weighting methods. In M. Bates, editor, Human Language Technology. Morgan Kaufman, 1993. Google ScholarDigital Library
- Chris Buckley, James Allan, Gerard Salton, and Amit Singhal. Automatic query expansion using SMART : TREC 3. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TRE6-3), pages 69--80. NIST Special Publication 500--225, April 1995.Google Scholar
- D. K. Harman. Overview of the third Text REtrieval Conference (TREC-3). In D. K. Harman, editor, Proceedings of the Third Text REt rt ma! Conference (TREC-3), pages 1--19. NIST Special Publication 500-225, April 1995.Google ScholarCross Ref
- S.E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C.J. van Rijsbergen, editors, Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232-241. Springer-Verlag, New York, July 1994. Google ScholarCross Ref
- S.E. Robertson, S. Walker, S. Jones, M.M. Hancock- Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TREC-3), pages 109--126. NIST Special Publication 500--225, April 1995.Google Scholar
- Gerard Salton. A utornahc text processing---the transformation, analysis and retrieval of information by computer. Addison-Wesley Publishing Co., Reading, MA, 1989.Google ScholarDigital Library
- Gerard Salton and Ohris Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Managernerat, 24(5):513--523, 1988. Google ScholarDigital Library
- Gerard Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York, 1983.Google ScholarDigital Library
- Gerard Salton, A. Wong, and C.S. Yang. A vector space model for information retrievel. Journal of the American Society for Information Science, 18(11):613--620, November 1975.Google Scholar
- Amit Singhal, Gerard Salt, on, and Chris Buckley. Length normalization in degraded text collections. In Fifth Annual Symposium on Document Analysis and Information Retrieval, pages 149--162, April 1996. Also Technical Report TR95-1507, Department of Computer Science, Cornell [University, Ithaca, NY 14853, April 1995.Google Scholar
- Amit Singhal, Gerard Salton, Mandar Mitra, and Chris Buckley. Document length normalization. Inforrnation Processing and Management (to appear). Also Technical Report TR95-1529, Department of Computer Science, Cornell University, Ithaca, NY 14853, July 1S55.Google Scholar
- Howard Turtle. Inference Networks for Document Retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, MA 01003 1990. Available as COINS Technical Report 90-92.Google ScholarDigital Library
Index Terms
- Pivoted Document Length Normalization
Recommendations
Adapting pivoted document-length normalization for query size: Experiments in Chinese and English
The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but ...
Comments