skip to main content
research-article

Interpreting TF-IDF term weights as making relevance decisions

Published: 20 June 2008 Publication History

Abstract

A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these “local” relevance decisions as the “document-wide” relevance decision for the document. The significance of interpreting TF-IDF in this way is the potential to: (1) establish a unifying perspective about information retrieval as relevance decision-making; and (2) develop advanced TF-IDF-related term weights for future elaborate retrieval models. Our novel retrieval model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. In general, we show that the term-frequency factor of the ranking formula can be rendered into different term-frequency factors of existing retrieval systems. In the basic ranking formula, the remaining quantity - log p(r¯|td) is interpreted as the probability of randomly picking a nonrelevant usage (denoted by r¯) of term t. Mathematically, we show that this quantity can be approximated by the inverse document-frequency (IDF). Empirically, we show that this quantity is related to IDF, using four reference TREC ad hoc retrieval data collections.

References

[1]
Aizawa, A. 2003. An information-theoretic perspective of TF-IDF measures. Inf. Process. Manage. 39, 1, 45--65.]]
[2]
Amati, G. and van Rijsbergen, C. J. 1998. Semantic information retrieval. In Information Retrieval: Uncertainty and Logics, C. J. Van Rijsbergen et al., Eds. Kluwer Academic, 189--220.]]
[3]
Amati, G. and van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4, 357--389.]]
[4]
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, New York.]]
[5]
Bodoff, D. and Robertson, S. E. 2004. A new unified probabilistic model. J. Amer. Soc. Inf. Sci. Technol. 55, 6, 471--487.]]
[6]
Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Amer. Soc. Inf. Sci. 25, 312--318.]]
[7]
Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., and Silva, I. 2003. Local versus global link information in the Web. ACM Trans. Inf. Syst. 21, 1, 42--63.]]
[8]
Clarke, C. L. A. and Scholer, F. 2005. The 2005 terabyte track. In Proceedings of the 14th Text Retrieval Conference, Gaithersburg, MD, E. M. Voorhees and L. P. Buckland, Eds. National Institute of Standards and Technology.]]
[9]
Clough, P., Sanderson, M., and Müller, H. 2004. The CLEF cross language image retrieval track (ImageCLEF) 2004. In Proceedings of 3rd International Conference on Image and Video Conference, Dublin, Ireland, P. Enser et al., Eds. Lecture Notes in Computer Science, vol. 3115. Springer, 243--251.]]
[10]
Cooper, W. S., Chen, A., and Gey, F. C. 1993. Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression. In Proceedings of the 2nd Text Retrieval Conference, Gaithersburg, MD, D. K. Harman, Ed. National Institute of Standards and Technology, 57--66.]]
[11]
Cooper, W. S., Gey, F. C., and Dabney, D. P. 1992. Probabilistic retrieval based on staged logistic regression. In Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, E. Fox et al., Eds. ACM, New York, 198--210.]]
[12]
Cooper, W. S. and Maron, M. E. 1978. Foundations of probabilistic and utility-theoretic indexing. J. ACM 25, 1, 67--80.]]
[13]
Crestani, F., Lalmas, M., van Rijsbergen, C. J., and Campbell, I. 1998. “Is this document relevant?… probably”: A survey of probabilistic models in information retrieval. ACM Comput. Surv. 30, 4, 528--552.]]
[14]
Croft, W. B. and Harper, D. J. 1979. Using probabilistic models of document retrieval without relevance information. J. Document. 35, 285--295.]]
[15]
Damerau, F. 1965. An experiment in automatic indexing. Amer. Document. 16, 283--289.]]
[16]
de Vries, A. P. and Roelleke, T. 2005. Relevance information: A loss of entropy but a gain for IDF? In Proceedings of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, G. Marchionini et al., Eds. ACM, New York, 282--289.]]
[17]
Dombi, J. 1982. A general class of fuzzy operators, the DeMorgan class of fuzzy operators and fuzziness measures induced by fuzzy operators. Fuzzy Sets Syst. 8, 149--163.]]
[18]
Dyckhoff, H. and Pedrycz, W. 1984. Generalized means as a model of compensative connectives. Fuzzy Sets Syst. 14, 143--154.]]
[19]
Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd ed. Wiley, New York.]]
[20]
French, S. 1986. Decision Theory:An Introduction to the Mathematics of Rationality. Ellis Horwood, Chichester, UK.]]
[21]
Fuhr, N. 1989. Models for retrieval with probabilistic indexing. Inf. Process. Manage. 25, 1, 55--72.]]
[22]
Gale, W. A., Church, K. W., and Yarowsky, D. 1992. Work on statistical methods for word sense disambiguation. In Working Notes of the AAAI Fall Symposium Series, Probabilistic Approaches to Natural Language, Cambridge, MA, 54--60.]]
[23]
Harman, D. 2004. Personal communication at NTCIR-4.]]
[24]
Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. Ph.D. thesis, Graduate Library, The University of Chicago, Thesis no. T25146.]]
[25]
Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. Amer. Soc. Inf. Sci. 26, 4, 197--206.]]
[26]
Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. Amer. Soc. Inf. Sci. 26, 4, 280--289.]]
[27]
Hiemstra, D. 1998. A linguistically motivated probabilistic model of information retrieval. In Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries, Heraklion, Crete, Greece, C. Nikolaou and C. Stephanidis, Eds. Springer, London, 569--584.]]
[28]
Huang, X., Peng, F., Shuurmans, D., Cercone, N., and Robertson, S. E. 2003. Applying machine learning to text segmentation for information retrieval. Inf. Retr. 6, 3--4, 333--362.]]
[29]
Hung, K. Y., Luk, R. W. P., Yeung, D. S., Chung, K. F. L., and Shu, W. H. 2001. Determination of context window size. Int. J. Comput. Process. Orient. Lang. 14, 1, 71--80.]]
[30]
Joachims, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, D. H. Fisher, Ed. Morgan Kaufmann, San Francisco, CA, 143--151.]]
[31]
Kaszkiel, M., Zobel, J., and Sacks-Davis, R. 1999. Efficient passage ranking for document databases. ACM Trans. Inf. Syst. 17, 4, 406--439.]]
[32]
Klir, G. J. and Folger, T. A. 1988. Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, NJ.]]
[33]
Kwok, K. L. 1995. A network approach to probabilistic information retrieval. ACM Trans. Inf. Syst. 13, 3, 324--353.]]
[34]
Lau, K. Y. K. and Luk, R. W. P. 1999. Word-Sense classification by hierarchical clustering. J. Chinese Lang. Comput. 9, 1, 101--121.]]
[35]
Lavrenko, V. and Croft, W. B. 2001. Relevance-Based language model. In Proceedings of the 24th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, D. H. Kraft et al., Eds. ACM, New York, 120--127.]]
[36]
Lavrenko, V. and Croft, W. B. 2003. Relevance models in information retrieval. In Language Modeling for Information Retrieval, W. B. Croft and J. Lafferty, Eds. Kluwer Academic, Chapter 2.]]
[37]
Lee, L. 2007. IDF revisited: A simple new derivation within the Robertson-Spärck Jones probabilistic model. In Proceedings of the 30th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, C. L. A. Clarke et al., Eds. ACM, New York, 751--752.]]
[38]
Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th ACM Conference on Information and Knowledge Management, Mclean, VA, C. Nicholas et al., Eds. ACM, New York, 375--382.]]
[39]
Lucassen, J. M. and Mercer, R. L. 1984. An information-theoretic approach to automatic determination of phonemic baseforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, San Diego, CA. IEEE, 304--307.]]
[40]
Luhn, H. 1958. The automatic creation of literature abstracts. IBM J. Res. Devel. 2, 2, 159--165.]]
[41]
Margulis, E. L. 1992. N-Poisson document modelling. In Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, E. Fox et al. Eds. ACM, New York, 177--189.]]
[42]
Maron, M. E. and Kuhns, J. L. 1960. On relevance, probabilistic indexing and information retrieval. J. ACM 25, 3, 216--244.]]
[43]
Paice, C. D. 1984. Soft evaluation of Boolean search queries in information retrieval systems. Inf. Technol. Res. Devel. Appl. 3, 1, 33--41.]]
[44]
Papineni, K. 2001. Why inverse document frequency? In Proceedings of the 2nd Meeting of the North American Chapter of The Association for Computational Linguistics, Pittsburgh, PA, L. Levin et al., Eds. Association for Computational Linguistics, Morristown, NJ, 25--32.]]
[45]
Pickens, J. and Macfarlane, A. 2006. Term context models for information retrieval. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, Arlington, VA, E. Fox et al., Eds. ACM, New York, 559--566.]]
[46]
Ponte, J. M. and Croft, W. B. 1998. A language modeling approach in information retrieval. In Proceedings of the 21st Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, W. B. Croft et al., Eds. ACM, New York, 275--281.]]
[47]
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.]]
[48]
Robertson, S. E. 1977. The probability ranking principle in IR. J. Document. 33, 4, 294--304.]]
[49]
Robertson, S. E. 1997. Overview of the Okapi projects. J. Document. 53, 1, 3--7.]]
[50]
Robertson, S. E. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Document. 60, 5, 503--520.]]
[51]
Robertson, S. E. 2005. On event spaces and probabilistic models in information retrieval. Inf. Retr. 8, 2, 319--329.]]
[52]
Robertson, S. E. and Spärck Jones, K. 1976. Relevance weighting of search terms. J. Amer. Soc. Inf. Sci. 27, 3, 129--146.]]
[53]
Robertson, S. E., van Rijsbergen, C. J., and Porter, M. F. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, R. N. Oddy et al., Eds. Butterworths, 35--56.]]
[54]
Robertson, S. E. and Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, W. B. Croft and C. J. van Rijsbergen, Eds. ACM, New York, 232--241.]]
[55]
Robertson, S. E. and Walker, S. 1997. On relevance weights with little relevance information. In Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, F. Can et al., Eds. ACM, New York, 16--24.]]
[56]
Robertson, S. E., Walker, S., and Hancock-Beaulieu, M. M. 1995. Large test collection experiments on an operational, interactive system: Okapi at TREC. Inf. Process. Manage. 31, 3, 345--360.]]
[57]
Roelleke, T. 2003. A frequency-based and a Poisson-based definition of probability of being informative. In Proceedings of the 26th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, C. Clarke et al., Eds. ACM, New York, 227--234.]]
[58]
Roelleke, T. and Wang, J. 2006. A parallel derivation of probabilistic information retrieval models. In Proceedings of the 29th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, S. Dumais, et al., Eds. ACM, New York, 107--114.]]
[59]
Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513--523.]]
[60]
Salton, G., Fox, E. A., and Wu, H. 1983. Extended Boolean information retrieval. Commun. ACM 26, 11, 1022--1036.]]
[61]
Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for information retrieval. J. Amer. Soc. Inf. Sci. 18, 11, 613--620.]]
[62]
Spärck Jones, K. 1972. Exhaustivity and specificity. J. Document. 28, 11--21.]]
[63]
Spärck Jones, K. 2004. IDF term weighting and IR research lessons. J. Document. 60, 521--523.]]
[64]
Spärck Jones, K., Walker, S., and Robertson, S. E. 2000. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Inf. Process. Manage. 36, 6, 809--840.]]
[65]
Trotman, A. and Geva, S. 2006. Passage retrieval and other XML-retrieval tasks. In Proceedings of the ACM SIGIR Workshop on XML Element Retrieval Methodology, Seattle, WA, A. Trotman and S. Geva, Eds., 43--50.]]
[66]
van Rijsbergen, C. J. 1975. Information Retrieval. Butterworths, London.]]
[67]
Vechtomova, O., Karamuftuoglu, M., and Robertson, S. E. 2006. On document relevance and lexical cohesion between query terms. Inf. Process. Manage. 24, 5, 1230--1247.]]
[68]
Waller, W. G. and Kraft, D. H. 1979. A mathematical model of a weighted Boolean retrieval system. Inf. Process. Manage. 15, 5, 235--245.]]
[69]
Wong, A. K. C. and Ghahraman, D. 1975. A statistical analysis of interdependence in character sequences. Inf. Sci. 8, 2, 173--188.]]
[70]
Wong, S. K. M., Ziarko, W., Raghavan, V. V., and Wong, P. C. N. 1986. On extending the vector space model for Boolean query processing. In Proceedings of the 9th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Palazzo dei Congressi, Pisa, Italy, F. Rabitti, Ed. ACM, New York, 175--185.]]
[71]
Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2005. A retrospective study of probabilistic context-based retrieval. In Proceedings of the 28th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, G. Marchionini et al., Eds. ACM, New York, 663--664.]]
[72]
Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2006. Probabilistic document-context based relevance feedback with limited relevance judgment. In Proceedings of the 15th ACM Conference on Information and Knowledge Management, Arlington, VA, P. S. Yu et al., Eds. ACM, New York, 854--855.]]
[73]
Wu, H. C., Luk, R. W. P., Wong, K. F., and Kwok, K. L. 2007. A retrospective study of a hybrid document-context based retrieval model. Inf. Process. Manage. 43, 5, 1308--1331.]]
[74]
Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval using local context analysis. ACM Trans. Inf. Syst, 18, 1, 79--112.]]
[75]
Yao, Y. Y. and Wong, S. K. M. 1991. Preference structure, inference and set-oriented retrieval. In Proceedings of the 14th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago, IL, E. Fox, Ed. ACM, New York, 211--218.]]
[76]
Yu, C. T. and Salton, G. 1976. Precision weighting—An effective automatic indexing method. J. ACM 23, 1, 76--88.]]
[77]
Zahn, C. T. 1971. Graph-Theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 20, 1, 68--86.]]
[78]
Zhai, C. X. and Lafferty, J. 2003. A risk minimization framework for information retrieval. In Proceedings of the ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval.]]
[79]
Zhai, C. X. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2, 179--214.]]

Cited By

View all
  • (2025) Genetic Transfer in Action: Uncovering DNA Flow in an Extremophilic Microbial Community Environmental Microbiology10.1111/1462-2920.7004827:2Online publication date: 3-Feb-2025
  • (2025)Enhancing Product Bundling with Large Language Models2025 13th International Conference on Intelligent Control and Information Processing (ICICIP)10.1109/ICICIP64458.2025.10898122(151-154)Online publication date: 6-Feb-2025
  • (2025)Hardware Trojan Detection in Open-Source Hardware Designs Using Machine LearningIEEE Access10.1109/ACCESS.2025.354615613(37771-37788)Online publication date: 2025
  • Show More Cited By

Index Terms

  1. Interpreting TF-IDF term weights as making relevance decisions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 26, Issue 3
    June 2008
    236 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/1361684
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 June 2008
    Accepted: 01 September 2007
    Revised: 01 July 2007
    Received: 01 August 2004
    Published in TOIS Volume 26, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Information retrieval
    2. relevance decision
    3. term weight

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)362
    • Downloads (Last 6 weeks)42
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025) Genetic Transfer in Action: Uncovering DNA Flow in an Extremophilic Microbial Community Environmental Microbiology10.1111/1462-2920.7004827:2Online publication date: 3-Feb-2025
    • (2025)Enhancing Product Bundling with Large Language Models2025 13th International Conference on Intelligent Control and Information Processing (ICICIP)10.1109/ICICIP64458.2025.10898122(151-154)Online publication date: 6-Feb-2025
    • (2025)Hardware Trojan Detection in Open-Source Hardware Designs Using Machine LearningIEEE Access10.1109/ACCESS.2025.354615613(37771-37788)Online publication date: 2025
    • (2025)Generative AI-enhanced human-AI collaborative conceptual design: A systematic literature reviewDesign Studies10.1016/j.destud.2025.10130097(101300)Online publication date: Mar-2025
    • (2025)Personalized product design and user review and experience analysis: A data-driven hybrid novel approachComputers & Industrial Engineering10.1016/j.cie.2025.110939202(110939)Online publication date: Apr-2025
    • (2024)Research Trends on Living Donors for Liver Transplantation: A Text Network Analysis and Topic ModelingJournal of Korean Academy of Fundamentals of Nursing10.7739/jkafn.2024.31.2.15731:2(157-167)Online publication date: 31-May-2024
    • (2024)Ensemble learning approach for distinguishing human and computer-generated Arabic reviewsPeerJ Computer Science10.7717/peerj-cs.234510(e2345)Online publication date: 30-Sep-2024
    • (2024)An Efficient Framework for Finding Similar Datasets Based on OntologyElectronics10.3390/electronics1322441713:22(4417)Online publication date: 11-Nov-2024
    • (2024)Enriching Language Models with Graph-Based Context Information to Better Understand Textual DataElectronics10.3390/electronics1310191913:10(1919)Online publication date: 14-May-2024
    • (2024)PageRank Algorithm-Based Recommendation System for Construction Safety GuidelinesBuildings10.3390/buildings1410304114:10(3041)Online publication date: 24-Sep-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media