ACM Home Page
Please provide us with feedback. Feedback
Feature selection for text categorization on imbalanced data
Full text PdfPdf (203 KB)
Source ACM SIGKDD Explorations Newsletter archive
Volume 6 ,  Issue 1  (June 2004) table of contents
Special issue on learning from imbalanced datasets
SPECIAL ISSUE: Special issue on learning from imbalanced datasets table of contents
Pages: 80 - 89  
Year of Publication: 2004
ISSN:1931-0145
Authors
Zhaohui Zheng  University at Buffalo, Amherst, NY
Xiaoyun Wu  University at Buffalo, Amherst, NY
Rohini Srihari  University at Buffalo, Amherst, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 158,   Citation Count: 6
Additional Information:

abstract   references   cited by   collaborative colleagues   peer to peer  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007730.1007741
What is a DOI?

ABSTRACT

A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 2002.
 
5
 
6
D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis, and Information Retrieval, pages 81--93, 1994.
 
7
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
8
D. Mladeni. Machine Learning on non-homogeneous, distributed text data. PhD Dissertation, University of Ljubljana, Slovenia, 1998.
 
9
10
 
11
12
13
 
14
 
15
16
 
17

Collaborative Colleagues:
Zhaohui Zheng: colleagues
Xiaoyun Wu: colleagues
Rohini Srihari: colleagues

Peer to Peer - Readers of this Article have also read: