ABSTRACT
Category ranking provides a way to classify plain text documents into a pre-determined set of categories. This work proposes to have a look at typical document collections and analyze which measures and peculiarities can help us to represent documents so that the resulting features are as much discriminative and representative as possible. Considerations such as selecting only nouns and adjectives, taking expressions rather than words, and using measures like term length, are combined into a simple feature selection and weighting method to extract, select and weight especial n-grams. Several experiments are performed to prove the usefulness of the new schema with different data sets (Reuters and OHSUMED) and two different algorithms (SVM and a simple sum of weights). After evaluation, the new approach outperforms some of the best known and most widely used categorization methods.
- K. Aas and L. Eikvil. Text categorisation: A survey. Technical report, Norwegian Computer Center, June 1999.Google Scholar
- C. Apté, F.J. Damerau, and S.M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233--251, 1994. Google ScholarDigital Library
- R. Basili, A. Moschitti, and M.T. Pazienza. Language-sensitive text classification. In Proceeding of RIAO-00, 6th International Conference\Recherche d'Information Assistee par Ordinateur", pages 331--343, Paris, FR, 2000.Google Scholar
- S. Bloehdorn and A. Hotho. Boosting for text classification with semantic features. In Proceedings of the Workshop on Mining for and from the Semantic Web at the KDD-04, 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 70--87, 2004.Google Scholar
- L. Cai and T. Hofmann. Text categorization by boosting automatically extracted concepts. In Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, pages 182{189, Toronto, CA, 2003. Google ScholarDigital Library
- C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.Google Scholar
- K. Crammer and Y. Singer. A new family of online algorithms for category ranking. In Proceedings of SIGIR-02, 25th ACM International Conference on Research and Development in Information Retrieval, pages 151--158, Tampere, FI, 2002. Google ScholarDigital Library
- K. Crammer and Y. Singer. A family of additive online algorithms for category ranking. Journal of Machine Learning Research, 3:1025--1058, 2003. Google ScholarDigital Library
- G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289--1305, 2003. Google ScholarDigital Library
- A.W.G. Salton and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 11(18):613--620, 1975. Google ScholarDigital Library
- A.W.G. Salton and C. Yang. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 5(24):513--523, 1988. Google ScholarDigital Library
- M. Granitzer. Hierarchical text classification using methods from machine learning. Master's thesis, Graz University of Technology, 2003.Google Scholar
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137--142, Chemnitz, DE, 1998. Google ScholarDigital Library
- T. Joachims. Support Vector and Kernel Methods. SIGIR 2003 Tutorial. In SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, Toronto, CA, 2003.Google Scholar
- M. Kongovi, J.C. Guzman, and V. Dasigi. Text categorization: An experiment using phrases. In Proceedings of ECIR-02, 24th European Colloquium on Information Retrieval Research, pages 213--228, 2002. Google ScholarDigital Library
- D.D. Lewis. A sequential algorithm for training text classifiers: corrigendum and additional data. SIGIR Forum, 29(2):13--19, 1995. Google ScholarDigital Library
- D.D. Lewis and W.A. Gale. A sequential algorithm for training text classifiers. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 3--12, Dublin, IE, 1994. See also {16}. Google ScholarDigital Library
- L. Màrquez and J. Giménez. A general pos tagger generator based on support vector machines. Journal of Machine Learning Research, 2004. Software available at www.lsi.upc.edu/ nlp/SVMTool.Google Scholar
- D. Mladenić. Machine Learning on non-homogeneous, distributed text data. PhD thesis, J. Stefan Institute, University of Ljubljana, Ljubljana, SL, 1998.Google Scholar
- A. Moschitti and R. Basili. Complex linguistic features for text classification: A comprehensive study. In Proceedings of ECIR-04, 26th European Conference on Information Retrieval Research, 2004.Google ScholarCross Ref
- H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pages 67--73, Philadelphia, US, 1997. Google ScholarDigital Library
- M. Ruiz and P. Srinivasan. Hierarchical text classification using neural networks. Information Retrieval, 5(1):87--118, 2002. Google ScholarDigital Library
- F. Sebastiani. A tutorial on automated text categorisation. In Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pages 7--35, Buenos Aires, AR, 1999. An extended version appears as {24}.Google Scholar
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1{47, 2002. Google ScholarDigital Library
- A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pages 21--29, Zürich, CH, 1996. Google ScholarDigital Library
- C.J. Van Rijsbergen. Information Retrieval, 2nd edition. ButterWorths, London, 1979. Google ScholarDigital Library
- Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):69--90, 1999. Google ScholarDigital Library
- Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Google ScholarDigital Library
- Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explorations, 6(1):80--89, 2004. Google ScholarDigital Library
- G. Zu, W. Ohyama, T. Wakabayashi, and F. Kimura. Accuracy improvement of automatic text classification based on feature transformation. In Proceedings of DOCENG-03, ACM Symposium on Document engineering, pages 118--120, Grenoble, FR, 2003. Google ScholarDigital Library
Index Terms
- NEWPAR: an automatic feature selection and weighting schema for category ranking
Recommendations
Text categorization with class-based and corpus-based keyword selection
ISCIS'05: Proceedings of the 20th international conference on Computer and Information SciencesIn this paper, we examine the use of keywords in text categorization with SVM. In contrast to the usual belief, we reveal that using keywords instead of all words yields better performance both in terms of accuracy and time. Unlike the previous studies ...
NEWPAR: An Optimized Feature Selection and Weighting Schema for Category Ranking
Proceedings of the 2006 conference on STAIRS 2006: Proceedings of the Third Starting AI Researchers' SymposiumThis paper presents an automatic feature extraction method for category ranking. It has been evaluated using Reuters and OHSUMED data sets, outperforming some of the best known and most widely used approaches.
Cross-lingual text categorization: Conquering language boundaries in globalized environments
Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Comments