column

Pivoted Document Length Normalization

Authors:
Amit Singhal

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Chris Buckley

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Manclar Mitra

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

Authors Info & Claims

ACM SIGIR Forum Volume 51 Issue 2July 2017pp 176–184https://doi.org/10.1145/3130348.3130365

Published:02 August 2017Publication History

ACM SIGIR Forum

Abstract

Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrievaf probabilities for a particular normalization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collectzorz independent normalization technique. We use the idea of pivoting with the well known cosine normalization function. We point out some shortcomings of the cosine function andpresent two new normalization functions--pivoted unique normalization and piuotert byte size normalization.

References

J. Broglio, J.P. CaJlan, W.B. Croft, and D.W. Nachbar. Document retrieval and routing using the INQUERY system. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TREC-3), pages 29- 38. NIST Special Publication 500--225, April 1995.Google Scholar
Chris Buckley. The importance of proper weighting methods. In M. Bates, editor, Human Language Technology. Morgan Kaufman, 1993. Google ScholarDigital Library
Chris Buckley, James Allan, Gerard Salton, and Amit Singhal. Automatic query expansion using SMART : TREC 3. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TRE6-3), pages 69--80. NIST Special Publication 500--225, April 1995.Google Scholar
D. K. Harman. Overview of the third Text REtrieval Conference (TREC-3). In D. K. Harman, editor, Proceedings of the Third Text REt rt ma! Conference (TREC-3), pages 1--19. NIST Special Publication 500-225, April 1995.Google ScholarCross Ref
S.E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C.J. van Rijsbergen, editors, Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232-241. Springer-Verlag, New York, July 1994. Google ScholarCross Ref
S.E. Robertson, S. Walker, S. Jones, M.M. Hancock- Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, Proceedings of the Third Text REtrieval Conference (TREC-3), pages 109--126. NIST Special Publication 500--225, April 1995.Google Scholar
Gerard Salton. A utornahc text processing---the transformation, analysis and retrieval of information by computer. Addison-Wesley Publishing Co., Reading, MA, 1989.Google ScholarDigital Library
Gerard Salton and Ohris Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Managernerat, 24(5):513--523, 1988. Google ScholarDigital Library
Gerard Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York, 1983.Google ScholarDigital Library
Gerard Salton, A. Wong, and C.S. Yang. A vector space model for information retrievel. Journal of the American Society for Information Science, 18(11):613--620, November 1975.Google Scholar
Amit Singhal, Gerard Salt, on, and Chris Buckley. Length normalization in degraded text collections. In Fifth Annual Symposium on Document Analysis and Information Retrieval, pages 149--162, April 1996. Also Technical Report TR95-1507, Department of Computer Science, Cornell [University, Ithaca, NY 14853, April 1995.Google Scholar
Amit Singhal, Gerard Salton, Mandar Mitra, and Chris Buckley. Document length normalization. Inforrnation Processing and Management (to appear). Also Technical Report TR95-1529, Department of Computer Science, Cornell University, Ithaca, NY 14853, July 1S55.Google Scholar
Howard Turtle. Inference Networks for Document Retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, MA 01003 1990. Available as COINS Technical Report 90-92.Google ScholarDigital Library

Index Terms

Pivoted Document Length Normalization
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but ...
Read More
Pivoted Document Length Normalization
Read More
Document Length Normalization
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGIR Forum Volume 51, Issue 2
SIGIR Test-of-Time Awardees 1978-2001
July 2017
276 pages
ISSN:0163-5840
DOI:10.1145/3130348
Editors:
Donna Harman
National Institutes of Science & Technology, Gaithersburg MD, USA
,
Diane Kelly
University of Tennessee, Knoxville TN, USA
Issue’s Table of Contents
Copyright © 2017 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 August 2017
Check for updates
Qualifiers
- column
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 590
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Pivoted Document Length Normalization

ACM SIGIR Forum

Abstract

References

Cited By

Index Terms

Recommendations

Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

Pivoted Document Length Normalization

Document Length Normalization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Pivoted Document Length Normalization

ACM SIGIR Forum

Abstract

References

Cited By

Index Terms

Recommendations

Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

Pivoted Document Length Normalization

Document Length Normalization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media