Article

NEWPAR: an automatic feature selection and weighting schema for category ranking

Authors:
Fernando Ruiz-Rico

University of Alicante

University of Alicante
View Profile

,
Jose Luis Vicedo

University of Alicante

University of Alicante
View Profile

,
María-Consuelo Rubio-Sánchez

University of Alicante

University of Alicante
View Profile

DocEng '06: Proceedings of the 2006 ACM symposium on Document engineeringOctober 2006Pages 128–137https://doi.org/10.1145/1166160.1166196

Published:10 October 2006Publication History

DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering

Pages 128–137

ABSTRACT

Category ranking provides a way to classify plain text documents into a pre-determined set of categories. This work proposes to have a look at typical document collections and analyze which measures and peculiarities can help us to represent documents so that the resulting features are as much discriminative and representative as possible. Considerations such as selecting only nouns and adjectives, taking expressions rather than words, and using measures like term length, are combined into a simple feature selection and weighting method to extract, select and weight especial n-grams. Several experiments are performed to prove the usefulness of the new schema with different data sets (Reuters and OHSUMED) and two different algorithms (SVM and a simple sum of weights). After evaluation, the new approach outperforms some of the best known and most widely used categorization methods.

References

K. Aas and L. Eikvil. Text categorisation: A survey. Technical report, Norwegian Computer Center, June 1999.Google Scholar
C. Apté, F.J. Damerau, and S.M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233--251, 1994. Google ScholarDigital Library
R. Basili, A. Moschitti, and M.T. Pazienza. Language-sensitive text classification. In Proceeding of RIAO-00, 6th International Conference\Recherche d'Information Assistee par Ordinateur", pages 331--343, Paris, FR, 2000.Google Scholar
S. Bloehdorn and A. Hotho. Boosting for text classification with semantic features. In Proceedings of the Workshop on Mining for and from the Semantic Web at the KDD-04, 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 70--87, 2004.Google Scholar
L. Cai and T. Hofmann. Text categorization by boosting automatically extracted concepts. In Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, pages 182{189, Toronto, CA, 2003. Google ScholarDigital Library
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.Google Scholar
K. Crammer and Y. Singer. A new family of online algorithms for category ranking. In Proceedings of SIGIR-02, 25th ACM International Conference on Research and Development in Information Retrieval, pages 151--158, Tampere, FI, 2002. Google ScholarDigital Library
K. Crammer and Y. Singer. A family of additive online algorithms for category ranking. Journal of Machine Learning Research, 3:1025--1058, 2003. Google ScholarDigital Library
G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289--1305, 2003. Google ScholarDigital Library
A.W.G. Salton and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 11(18):613--620, 1975. Google ScholarDigital Library
A.W.G. Salton and C. Yang. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 5(24):513--523, 1988. Google ScholarDigital Library
M. Granitzer. Hierarchical text classification using methods from machine learning. Master's thesis, Graz University of Technology, 2003.Google Scholar
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137--142, Chemnitz, DE, 1998. Google ScholarDigital Library
T. Joachims. Support Vector and Kernel Methods. SIGIR 2003 Tutorial. In SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, Toronto, CA, 2003.Google Scholar
M. Kongovi, J.C. Guzman, and V. Dasigi. Text categorization: An experiment using phrases. In Proceedings of ECIR-02, 24th European Colloquium on Information Retrieval Research, pages 213--228, 2002. Google ScholarDigital Library
D.D. Lewis. A sequential algorithm for training text classifiers: corrigendum and additional data. SIGIR Forum, 29(2):13--19, 1995. Google ScholarDigital Library
D.D. Lewis and W.A. Gale. A sequential algorithm for training text classifiers. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 3--12, Dublin, IE, 1994. See also {16}. Google ScholarDigital Library
L. Màrquez and J. Giménez. A general pos tagger generator based on support vector machines. Journal of Machine Learning Research, 2004. Software available at www.lsi.upc.edu/ nlp/SVMTool.Google Scholar
D. Mladenić. Machine Learning on non-homogeneous, distributed text data. PhD thesis, J. Stefan Institute, University of Ljubljana, Ljubljana, SL, 1998.Google Scholar
A. Moschitti and R. Basili. Complex linguistic features for text classification: A comprehensive study. In Proceedings of ECIR-04, 26th European Conference on Information Retrieval Research, 2004.Google ScholarCross Ref
H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pages 67--73, Philadelphia, US, 1997. Google ScholarDigital Library
M. Ruiz and P. Srinivasan. Hierarchical text classification using neural networks. Information Retrieval, 5(1):87--118, 2002. Google ScholarDigital Library
F. Sebastiani. A tutorial on automated text categorisation. In Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pages 7--35, Buenos Aires, AR, 1999. An extended version appears as {24}.Google Scholar
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1{47, 2002. Google ScholarDigital Library
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pages 21--29, Zürich, CH, 1996. Google ScholarDigital Library
C.J. Van Rijsbergen. Information Retrieval, 2nd edition. ButterWorths, London, 1979. Google ScholarDigital Library
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):69--90, 1999. Google ScholarDigital Library
Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Google ScholarDigital Library
Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explorations, 6(1):80--89, 2004. Google ScholarDigital Library
G. Zu, W. Ohyama, T. Wakabayashi, and F. Kimura. Accuracy improvement of automatic text classification based on feature transformation. In Proceedings of DOCENG-03, ACM Symposium on Document engineering, pages 118--120, Grenoble, FR, 2003. Google ScholarDigital Library

Index Terms

NEWPAR: an automatic feature selection and weighting schema for category ranking
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Logical and relational learning
        Inductive logic learning
2. Information systems
  1. Information retrieval

Recommendations

Text categorization with class-based and corpus-based keyword selection
ISCIS'05: Proceedings of the 20th international conference on Computer and Information Sciences

In this paper, we examine the use of keywords in text categorization with SVM. In contrast to the usual belief, we reveal that using keywords instead of all words yields better performance both in terms of accuracy and time. Unlike the previous studies ...
Read More
NEWPAR: An Optimized Feature Selection and Weighting Schema for Category Ranking
Proceedings of the 2006 conference on STAIRS 2006: Proceedings of the Third Starting AI Researchers' Symposium

This paper presents an automatic feature extraction method for category ranking. It has been evaluated using Reuters and OHSUMED data sets, outperforming some of the best known and most widely used approaches.

Read More
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Read More

Reviews

Reviewer: Jonathan P. E. Hodgson

The classification of plain-text documents is an ongoing challenge in information research. This paper proposes an original mixture of existing ideas for the categorization of plain text documents. Document classification is usually done by associating each document with a vector of weights, computed from terms that appear in the document. A training set is used to establish a set of vectors, each one of which is a prototype for a particular category. Documents are assigned to categories based on the closeness of the document';s weight vector to the prototype vector of a category. It is possible to assign a document to more than one category. The distinctiveness of NEWPAR, the technique described in the paper, is based in part on the use of only certain n-grams from the text, namely, nouns or nouns preceded by adjectives; verbs in particular are discarded. N-grams that match category descriptors, or those included in titles, are given greater weight. Measures such as term frequency and document frequency, instead of being taken over the whole corpus, are used within each category to select the most discriminating expressions. The category frequency, which measures the number of categories in which an expression occurs, is also used to discriminate among categories. The paper includes results from experiments in which NEWPAR was applied to existing data sets. While in isolated cases NEWPAR is outperformed by one of the other algorithms to which it is compared, NEWPAR with the simple sum of weights criterion is shown to perform well in all cases. The paper is clearly written, and can be read by anyone who has a basic understanding of support vector methods. One issue that is not addressed is that of the overhead for expression extraction: since this relies on stemming and part-of-speech tagging, it may be substantial.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering
October 2006
232 pages
ISBN:1595935150
DOI:10.1145/1166160
General Chair:
Dick Bulterman
CWI, Netherlands
,
Program Chair:
David F. Brailsford
University of Nottingham, UK
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SVM
category ranking
machine learning
text categorization
text classifcation
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 568
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NEWPAR: an automatic feature selection and weighting schema for category ranking

DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text categorization with class-based and corpus-based keyword selection

NEWPAR: An Optimized Feature Selection and Weighting Schema for Category Ranking

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

NEWPAR: an automatic feature selection and weighting schema for category ranking

DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text categorization with class-based and corpus-based keyword selection

NEWPAR: An Optimized Feature Selection and Weighting Schema for Category Ranking

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media