skip to main content
10.1145/1166160.1166196acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

NEWPAR: an automatic feature selection and weighting schema for category ranking

Published:10 October 2006Publication History

ABSTRACT

Category ranking provides a way to classify plain text documents into a pre-determined set of categories. This work proposes to have a look at typical document collections and analyze which measures and peculiarities can help us to represent documents so that the resulting features are as much discriminative and representative as possible. Considerations such as selecting only nouns and adjectives, taking expressions rather than words, and using measures like term length, are combined into a simple feature selection and weighting method to extract, select and weight especial n-grams. Several experiments are performed to prove the usefulness of the new schema with different data sets (Reuters and OHSUMED) and two different algorithms (SVM and a simple sum of weights). After evaluation, the new approach outperforms some of the best known and most widely used categorization methods.

References

  1. K. Aas and L. Eikvil. Text categorisation: A survey. Technical report, Norwegian Computer Center, June 1999.Google ScholarGoogle Scholar
  2. C. Apté, F.J. Damerau, and S.M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233--251, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Basili, A. Moschitti, and M.T. Pazienza. Language-sensitive text classification. In Proceeding of RIAO-00, 6th International Conference\Recherche d'Information Assistee par Ordinateur", pages 331--343, Paris, FR, 2000.Google ScholarGoogle Scholar
  4. S. Bloehdorn and A. Hotho. Boosting for text classification with semantic features. In Proceedings of the Workshop on Mining for and from the Semantic Web at the KDD-04, 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 70--87, 2004.Google ScholarGoogle Scholar
  5. L. Cai and T. Hofmann. Text categorization by boosting automatically extracted concepts. In Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, pages 182{189, Toronto, CA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.Google ScholarGoogle Scholar
  7. K. Crammer and Y. Singer. A new family of online algorithms for category ranking. In Proceedings of SIGIR-02, 25th ACM International Conference on Research and Development in Information Retrieval, pages 151--158, Tampere, FI, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Crammer and Y. Singer. A family of additive online algorithms for category ranking. Journal of Machine Learning Research, 3:1025--1058, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289--1305, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A.W.G. Salton and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 11(18):613--620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A.W.G. Salton and C. Yang. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 5(24):513--523, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Granitzer. Hierarchical text classification using methods from machine learning. Master's thesis, Graz University of Technology, 2003.Google ScholarGoogle Scholar
  13. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137--142, Chemnitz, DE, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims. Support Vector and Kernel Methods. SIGIR 2003 Tutorial. In SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, Toronto, CA, 2003.Google ScholarGoogle Scholar
  15. M. Kongovi, J.C. Guzman, and V. Dasigi. Text categorization: An experiment using phrases. In Proceedings of ECIR-02, 24th European Colloquium on Information Retrieval Research, pages 213--228, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D.D. Lewis. A sequential algorithm for training text classifiers: corrigendum and additional data. SIGIR Forum, 29(2):13--19, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D.D. Lewis and W.A. Gale. A sequential algorithm for training text classifiers. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 3--12, Dublin, IE, 1994. See also {16}. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Màrquez and J. Giménez. A general pos tagger generator based on support vector machines. Journal of Machine Learning Research, 2004. Software available at www.lsi.upc.edu/ nlp/SVMTool.Google ScholarGoogle Scholar
  19. D. Mladenić. Machine Learning on non-homogeneous, distributed text data. PhD thesis, J. Stefan Institute, University of Ljubljana, Ljubljana, SL, 1998.Google ScholarGoogle Scholar
  20. A. Moschitti and R. Basili. Complex linguistic features for text classification: A comprehensive study. In Proceedings of ECIR-04, 26th European Conference on Information Retrieval Research, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  21. H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pages 67--73, Philadelphia, US, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Ruiz and P. Srinivasan. Hierarchical text classification using neural networks. Information Retrieval, 5(1):87--118, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Sebastiani. A tutorial on automated text categorisation. In Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pages 7--35, Buenos Aires, AR, 1999. An extended version appears as {24}.Google ScholarGoogle Scholar
  24. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1{47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pages 21--29, Zürich, CH, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C.J. Van Rijsbergen. Information Retrieval, 2nd edition. ButterWorths, London, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):69--90, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explorations, 6(1):80--89, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Zu, W. Ohyama, T. Wakabayashi, and F. Kimura. Accuracy improvement of automatic text classification based on feature transformation. In Proceedings of DOCENG-03, ACM Symposium on Document engineering, pages 118--120, Grenoble, FR, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. NEWPAR: an automatic feature selection and weighting schema for category ranking

            Recommendations

            Reviews

            Jonathan P. E. Hodgson

            The classification of plain-text documents is an ongoing challenge in information research. This paper proposes an original mixture of existing ideas for the categorization of plain text documents. Document classification is usually done by associating each document with a vector of weights, computed from terms that appear in the document. A training set is used to establish a set of vectors, each one of which is a prototype for a particular category. Documents are assigned to categories based on the closeness of the document';s weight vector to the prototype vector of a category. It is possible to assign a document to more than one category. The distinctiveness of NEWPAR, the technique described in the paper, is based in part on the use of only certain n-grams from the text, namely, nouns or nouns preceded by adjectives; verbs in particular are discarded. N-grams that match category descriptors, or those included in titles, are given greater weight. Measures such as term frequency and document frequency, instead of being taken over the whole corpus, are used within each category to select the most discriminating expressions. The category frequency, which measures the number of categories in which an expression occurs, is also used to discriminate among categories. The paper includes results from experiments in which NEWPAR was applied to existing data sets. While in isolated cases NEWPAR is outperformed by one of the other algorithms to which it is compared, NEWPAR with the simple sum of weights criterion is shown to perform well in all cases. The paper is clearly written, and can be read by anyone who has a basic understanding of support vector methods. One issue that is not addressed is that of the overhead for expression extraction: since this relies on stemming and part-of-speech tagging, it may be substantial.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering
              October 2006
              232 pages
              ISBN:1595935150
              DOI:10.1145/1166160

              Copyright © 2006 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 10 October 2006

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate178of537submissions,33%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader