skip to main content
column

Pivoted Document Length Normalization

Published:02 August 2017Publication History
Skip Abstract Section

Abstract

Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrievaf probabilities for a particular normalization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collectzorz independent normalization technique. We use the idea of pivoting with the well known cosine normalization function. We point out some shortcomings of the cosine function andpresent two new normalization functions--pivoted unique normalization and piuotert byte size normalization.

References

  1. J. Broglio, J.P. CaJlan, W.B. Croft, and D.W. Nachbar. Document retrieval and routing using the INQUERY system. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TREC-3), pages 29- 38. NIST Special Publication 500--225, April 1995.Google ScholarGoogle Scholar
  2. Chris Buckley. The importance of proper weighting methods. In M. Bates, editor, Human Language Technology. Morgan Kaufman, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chris Buckley, James Allan, Gerard Salton, and Amit Singhal. Automatic query expansion using SMART : TREC 3. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TRE6-3), pages 69--80. NIST Special Publication 500--225, April 1995.Google ScholarGoogle Scholar
  4. D. K. Harman. Overview of the third Text REtrieval Conference (TREC-3). In D. K. Harman, editor, Proceedings of the Third Text REt rt ma! Conference (TREC-3), pages 1--19. NIST Special Publication 500-225, April 1995.Google ScholarGoogle ScholarCross RefCross Ref
  5. S.E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C.J. van Rijsbergen, editors, Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232-241. Springer-Verlag, New York, July 1994. Google ScholarGoogle ScholarCross RefCross Ref
  6. S.E. Robertson, S. Walker, S. Jones, M.M. Hancock- Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TREC-3), pages 109--126. NIST Special Publication 500--225, April 1995.Google ScholarGoogle Scholar
  7. Gerard Salton. A utornahc text processing---the transformation, analysis and retrieval of information by computer. Addison-Wesley Publishing Co., Reading, MA, 1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gerard Salton and Ohris Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Managernerat, 24(5):513--523, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gerard Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gerard Salton, A. Wong, and C.S. Yang. A vector space model for information retrievel. Journal of the American Society for Information Science, 18(11):613--620, November 1975.Google ScholarGoogle Scholar
  11. Amit Singhal, Gerard Salt, on, and Chris Buckley. Length normalization in degraded text collections. In Fifth Annual Symposium on Document Analysis and Information Retrieval, pages 149--162, April 1996. Also Technical Report TR95-1507, Department of Computer Science, Cornell [University, Ithaca, NY 14853, April 1995.Google ScholarGoogle Scholar
  12. Amit Singhal, Gerard Salton, Mandar Mitra, and Chris Buckley. Document length normalization. Inforrnation Processing and Management (to appear). Also Technical Report TR95-1529, Department of Computer Science, Cornell University, Ithaca, NY 14853, July 1S55.Google ScholarGoogle Scholar
  13. Howard Turtle. Inference Networks for Document Retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, MA 01003 1990. Available as COINS Technical Report 90-92.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pivoted Document Length Normalization
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGIR Forum
          ACM SIGIR Forum  Volume 51, Issue 2
          SIGIR Test-of-Time Awardees 1978-2001
          July 2017
          276 pages
          ISSN:0163-5840
          DOI:10.1145/3130348
          • Editors:
          • Donna Harman,
          • Diane Kelly
          Issue’s Table of Contents

          Copyright © 2017 Copyright is held by the owner/author(s)

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 August 2017

          Check for updates

          Qualifiers

          • column

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader