skip to main content
10.1145/1458082.1458137acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Generalized inverse document frequency

Published: 26 October 2008 Publication History

Abstract

Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. There have been various attempts to provide theoretical justifications for IDF. One of the most appealing derivations follows from the Robertson-Sparck Jones relevance weight. However, this derivation, and others related to it, typically make a number of strong assumptions that are often glossed over. In this paper, we re-examine these assumptions from a Bayesian perspective, discuss possible alternatives, and derive a new, more generalized form of IDF that we call generalized inverse document frequency. In addition to providing theoretical insights into IDF, we also undertake a rigorous empirical evaluation that shows generalized IDF outperforms classical versions of IDF on a number of ad hoc retrieval tasks.

References

[1]
A. Aizawa. An information-theoretic perspective of TF-IDF measures. Information Processing and Management, 39(1):45--65, 2003.
[2]
C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Inf. Retr., 10(6):491--508, 2007.
[3]
W. S. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. In Proc. 14th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 57--61, New York, NY, USA, 1991. ACM.
[4]
W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285--295, 1979.
[5]
A. P. de Vries and T. Roelleke. Relevance information: a loss of entropy but a gain for IDF? In Proc 28th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 282--289, New York, NY, USA, 2005. ACM.
[6]
W. R. Greiff. A theory of term weighting based on exploratory data analysis. In Proc. 21st Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 11--19, New York, NY, USA, 1998. ACM.
[7]
S. P. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 26:197--206 and 280--289, 1975.
[8]
B. He and I. Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst., 25(3):13, 2007.
[9]
K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11--21, 1972.
[10]
J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. In W. B. Croft and J. Lafferty, editors, Language Modeling and Information Retrieval. 2003.
[11]
V. Lavrenko. A Generative Theory of Relevance. PhD thesis, University of Massachsetts Amherst, Amherst, MA, 2006.
[12]
L. Lee. IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model. In Proc. 30th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 751--752, New York, NY, USA, 2007. ACM.
[13]
K. Papineni. Why inverse document frequency? In Proc 2nd Proc. North American Chapter of the Assn. for Computational Linguistics on Language Technologies, pages 1--8, Morristown, NJ, USA, 2001. Association for Computational Linguistics.
[14]
S. Robertson. The probability ranking principle in IR. Journal of Documentation, 33(4):294--304, 1977.
[15]
S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. 3rd Text REtrieval Conference, pages 109--126, 1994.
[16]
S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146, 1976.
[17]
S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter. Probabilistic models of indexing and searching. In Proc. 3rd Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 35--56, Kent, UK, 1981. Butterworth & Co.
[18]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. 17th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 232--241, New York, NY, USA, 1994. Springer-Verlag New York, Inc.
[19]
S. E. Robertson and S. Walker. On relevance weights with little relevance information. In Proc. 20th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 16--24, New York, NY, USA, 1997. ACM.
[20]
T. Roelleke. A frequency-based and a poisson-based definition of the probability of being informative. In Proc 26th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 227--234, New York, NY, USA, 2003. ACM.
[21]
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proc. 19th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 21--29, New York, NY, USA, 1996. ACM.
[22]
T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: A language model-based serach engine for complex queries. In Proceedings of the International Conference on Intelligence Analysis, 2004.
[23]
Z. Xu and R. Akella. A new probabilistic retrieval model based on the dirichlet compound multinomial distribution. In Proc. 31st Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, page To appear., 2008.
[24]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179--214, 2004.

Cited By

View all
  • (2023)Comparison of Sentiment Analysis Using Support Vector Machine and Word Sense Disambiguation2023 International Conference on Informatics, Multimedia, Cyber and Informations System (ICIMCIS)10.1109/ICIMCIS60089.2023.10349055(415-419)Online publication date: 7-Nov-2023
  • (2022)Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF2022 International Conference on Smart Information Systems and Technologies (SIST)10.1109/SIST54437.2022.9945747(1-6)Online publication date: 28-Apr-2022
  • (2022)Calculating the Similarity of Indonesian sentences using Latent Semantic Indexing based on KBBI2022 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS)10.1109/ICIMCIS56303.2022.10017797(148-153)Online publication date: 16-Nov-2022
  • Show More Cited By

Index Terms

  1. Generalized inverse document frequency

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
    October 2008
    1562 pages
    ISBN:9781595939913
    DOI:10.1145/1458082
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 October 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. estimation
    2. formal models
    3. inverse document frequency

    Qualifiers

    • Research-article

    Conference

    CIKM08
    CIKM08: Conference on Information and Knowledge Management
    October 26 - 30, 2008
    California, Napa Valley, USA

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)26
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Comparison of Sentiment Analysis Using Support Vector Machine and Word Sense Disambiguation2023 International Conference on Informatics, Multimedia, Cyber and Informations System (ICIMCIS)10.1109/ICIMCIS60089.2023.10349055(415-419)Online publication date: 7-Nov-2023
    • (2022)Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF2022 International Conference on Smart Information Systems and Technologies (SIST)10.1109/SIST54437.2022.9945747(1-6)Online publication date: 28-Apr-2022
    • (2022)Calculating the Similarity of Indonesian sentences using Latent Semantic Indexing based on KBBI2022 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS)10.1109/ICIMCIS56303.2022.10017797(148-153)Online publication date: 16-Nov-2022
    • (2020)Dynamic Boundary Time Warping for sub-sequence matching with few examplesExpert Systems with Applications10.1016/j.eswa.2020.114344(114344)Online publication date: Nov-2020
    • (2019)Term Weighting for Feature Extraction on Twitter: A Comparison Between BM25 and TF-IDF2019 International Conference on Advanced Science and Engineering (ICOASE)10.1109/ICOASE.2019.8723825(124-128)Online publication date: Apr-2019
    • (2018)Statistical computation and term weighting for feature extraction on Twitter2018 International Conference on Advance of Sustainable Engineering and its Application (ICASEA)10.1109/ICASEA.2018.8370966(109-114)Online publication date: Mar-2018
    • (2018)Sentence Similarity Computation by Integrating Shallow and Deep Information2018 International Conference on Asian Language Processing (IALP)10.1109/IALP.2018.8629105(308-311)Online publication date: Nov-2018
    • (2018)A systematic approach to normalization in probabilistic modelsInformation Retrieval Journal10.1007/s10791-018-9334-121:6(565-596)Online publication date: 30-Jun-2018
    • (2017)Assessment of Vulnerability Severity using Text MiningProceedings of the 21st Pan-Hellenic Conference on Informatics10.1145/3139367.3139390(1-6)Online publication date: 28-Sep-2017
    • (2017)IDF for Word N-gramsACM Transactions on Information Systems10.1145/305277536:1(1-38)Online publication date: 5-Jun-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media