skip to main content
10.1145/345508.345556acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article
Free Access

The feature quantity: an information theoretic perspective of Tfidf-like measures

Authors Info & Claims
Published:01 July 2000Publication History

ABSTRACT

The feature quantity, a quantitative representation of specificity introduced in this paper, is based on an information theoretic perspective of co-occurrence events between terms and documents. Mathematically, the feature quantity is defined as a product of probability and information, and maintains a good correspondence with the tfidf-like measures popularly used in today's IR systems. In this paper, we present a formal description of the feature quantity, as well as some illustrative examples of applying such a quantity to different types of information retrieval tasks: representative term selection and text categorization.

References

  1. 1.G. Amati and K. van Rijsbergen. Semantze Information Retrieval, 189-219. Kluwer Academic Pub., 1998. (in "Information Retrieval: Uncertainty and Logics").Google ScholarGoogle Scholar
  2. 2.C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. In SIGIR'94, 292-300, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.S. A. Caraballo and E. Charniak. Determining the specificity of nouns from text. In EMNLP'99, 1999.Google ScholarGoogle Scholar
  4. 4.W. R. Greiff. A theory of term weighting based on exploratory data analysis. In SIGIR'98, 11-19, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In ICM- L'97, 143-151, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.K. Kita. Probabilistic Language Model. University of Tokyo Press, Japan, 1999.Google ScholarGoogle Scholar
  7. 7.D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML '97, 170- 178, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.D. Maldenid and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working notes of Learning from Text and the Web, CONALD'98, 1998.Google ScholarGoogle Scholar
  9. 9.A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI- 98 Workshop on learning for text categorzzation, 42- 49, 1998.Google ScholarGoogle Scholar
  10. 10.NACSIS, editor. NTCIR Workshop 1 - proc. of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. National Center for Science Information Systems, 1999.Google ScholarGoogle Scholar
  11. 11.H. Ney, S. Martin, and F. Wessel. Statzstzcal Language Modeling using Leaving-one-out, 174-207. K- luwer Academic Pub., 1997. (in "Corpus-Based Methods in Language and Speech Processing").Google ScholarGoogle Scholar
  12. 12.Y. Singer and D. D. Lewis. Machine learning for information retrieval: Advanced techniques. In SI- GIR "99 Tutorial, 1999.Google ScholarGoogle Scholar
  13. 13.A. Takasu and K. Aihara. Variance based classifier comparison in text categorization (poster). In SIGIR 2000, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.S. Wong and Y. YaH. An information theoretic measure of term specificity. Journal of the Amemcan Soczety for Information Science, 43(1):54-61, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  15. 15.Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR '99, 42-49, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.Y. Yang and O. Pedersen. A comparative study on feature selection in text categorization. In ICML '97, 412-420, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The feature quantity: an information theoretic perspective of Tfidf-like measures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
      July 2000
      396 pages
      ISBN:1581132263
      DOI:10.1145/345508

      Copyright © 2000 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 July 2000

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader