ABSTRACT
The feature quantity, a quantitative representation of specificity introduced in this paper, is based on an information theoretic perspective of co-occurrence events between terms and documents. Mathematically, the feature quantity is defined as a product of probability and information, and maintains a good correspondence with the tfidf-like measures popularly used in today's IR systems. In this paper, we present a formal description of the feature quantity, as well as some illustrative examples of applying such a quantity to different types of information retrieval tasks: representative term selection and text categorization.
- 1.G. Amati and K. van Rijsbergen. Semantze Information Retrieval, 189-219. Kluwer Academic Pub., 1998. (in "Information Retrieval: Uncertainty and Logics").Google Scholar
- 2.C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. In SIGIR'94, 292-300, 1994. Google ScholarDigital Library
- 3.S. A. Caraballo and E. Charniak. Determining the specificity of nouns from text. In EMNLP'99, 1999.Google Scholar
- 4.W. R. Greiff. A theory of term weighting based on exploratory data analysis. In SIGIR'98, 11-19, 1998. Google ScholarDigital Library
- 5.T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In ICM- L'97, 143-151, 1999. Google ScholarDigital Library
- 6.K. Kita. Probabilistic Language Model. University of Tokyo Press, Japan, 1999.Google Scholar
- 7.D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML '97, 170- 178, 1997. Google ScholarDigital Library
- 8.D. Maldenid and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working notes of Learning from Text and the Web, CONALD'98, 1998.Google Scholar
- 9.A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI- 98 Workshop on learning for text categorzzation, 42- 49, 1998.Google Scholar
- 10.NACSIS, editor. NTCIR Workshop 1 - proc. of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. National Center for Science Information Systems, 1999.Google Scholar
- 11.H. Ney, S. Martin, and F. Wessel. Statzstzcal Language Modeling using Leaving-one-out, 174-207. K- luwer Academic Pub., 1997. (in "Corpus-Based Methods in Language and Speech Processing").Google Scholar
- 12.Y. Singer and D. D. Lewis. Machine learning for information retrieval: Advanced techniques. In SI- GIR "99 Tutorial, 1999.Google Scholar
- 13.A. Takasu and K. Aihara. Variance based classifier comparison in text categorization (poster). In SIGIR 2000, 2000. Google ScholarDigital Library
- 14.S. Wong and Y. YaH. An information theoretic measure of term specificity. Journal of the Amemcan Soczety for Information Science, 43(1):54-61, 1992.Google ScholarCross Ref
- 15.Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR '99, 42-49, 1999. Google ScholarDigital Library
- 16.Y. Yang and O. Pedersen. A comparative study on feature selection in text categorization. In ICML '97, 412-420, 1997. Google ScholarDigital Library
Index Terms
The feature quantity: an information theoretic perspective of Tfidf-like measures
Recommendations
Measuring Knowledge Delivery Quantity of Associated Knowledge Flow
SKG '08: Proceedings of the 2008 Fourth International Conference on Semantics, Knowledge and GridAssociated knowledge flow (AKF) is a sequential link between associated topics, which can be applied to intelligent browsing and personalized recommendation. One key problem is how to measure the knowledge delivery quantity (KDQ) on an AKF. In this ...
Physical Quantity: Towards a Pattern Language for Quantities and Units in Physical Calculations
EuroPLoP '17: Proceedings of the 22nd European Conference on Pattern Languages of ProgramsIn this paper an approach is taken towards a pattern language for physical quantities in software applications. The central pattern, Physical Quantity, is described as well as some needed candidate patterns revolving around. The Physical Quantity design ...
Unsupervised feature weighting based on local feature relatedness
PAKDD'11: Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part IFeature weighting plays an important role in text clustering. Traditional feature weighting is determined by the syntactic relationship between feature and document (e.g. TF-IDF). In this paper, a semantically enriched feature weighting approach is ...
Comments