| Feature selection for text categorization on imbalanced data |
| Full text |
Pdf
(203 KB)
|
| Source
|
ACM SIGKDD Explorations Newsletter
archive
Volume 6 , Issue 1 (June 2004)
table of contents
Special issue on learning from imbalanced datasets
SPECIAL ISSUE: Special issue on learning from imbalanced datasets
table of contents
Pages: 80 - 89
Year of Publication: 2004
ISSN:1931-0145
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 14, Downloads (12 Months): 158, Citation Count: 6
|
|
|
ABSTRACT
A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
2
|
|
| |
3
|
|
| |
4
|
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 2002.
|
| |
5
|
|
| |
6
|
D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis, and Information Retrieval, pages 81--93, 1994.
|
| |
7
|
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
8
|
D. Mladeni. Machine Learning on non-homogeneous, distributed text data. PhD Dissertation, University of Ljubljana, Slovenia, 1998.
|
| |
9
|
|
 |
10
|
Hwee Tou Ng , Wei Boon Goh , Kok Leong Low, Feature selection, perception learning, and a usability case study for text categorization, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, p.67-73, July 27-31, 1997, Philadelphia, Pennsylvania, United States
|
| |
11
|
|
 |
12
|
|
 |
13
|
Amit Singhal , Mandar Mitra , Chris Buckley, Learning routing queries in a query zone, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, p.25-32, July 27-31, 1997, Philadelphia, Pennsylvania, United States
|
| |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|