article

Feature selection for text categorization on imbalanced data

Authors:

Rohini SrihariAuthors Info & Claims

ACM SIGKDD Explorations Newsletter, Volume 6, Issue 1

Pages 80 - 89

https://doi.org/10.1145/1007730.1007741

Published: 01 June 2004 Publication History

Abstract

A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.

References

[1]

S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. Proceedigs of the Seventh International Conference on Information and Knowledge Management, pages 148--155, 1998.

Digital Library

[2]

T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3):291--316, 1997.

Digital Library

[3]

G. Forman. An extensive empirical study of feature selection metrics for text classification. JMLR, Special Issue on Variable and Feature Selection, pages 1289--1305, 2003.

Digital Library

[4]

N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 2002.

Digital Library

[5]

R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1--2):273--324, 1997.

Digital Library

[6]

D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis, and Information Retrieval, pages 81--93, 1994.

[7]

A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 1998.

[8]

D. Mladeni. Machine Learning on non-homogeneous, distributed text data. PhD Dissertation, University of Ljubljana, Slovenia, 1998.

[9]

D. Mladeni and G. Marko. Feture selection for unbalanced class distribution and naive bayes. The Sixteenth International Conference on Machine Learning, pages 258--267, 1999.

Digital Library

[10]

H. Ng, W. Goh, and K. Low. Feature selection, perceptron learning, and a usability case study for text categorization. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 67--73, 1997.

Digital Library

[11]

V. Rijsbergen. Information Retrieval. Butterworths, London, 1979.

Digital Library

[12]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.

Digital Library

[13]

A. Singhal, M. Mitra, and C. Buckley. Learning routing queries in a query zone. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pages 25--32, Philadelphia, US, 1997.

Digital Library

[14]

Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67--88, 1999.

Digital Library

[15]

Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. The Fourteenth International Conference on Machine Learning, pages 412--420, 1997.

Digital Library

[16]

J. Zhang and Y. Yang. Robustness of regularized linear classification methods in text categorization. ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.

Digital Library

[17]

T. Zhang and F. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4:5--31, 2001.

Digital Library

Cited By

Adegbenjo ANgadi M(2024)Handling the Imbalanced Problem in Agri-Food Data AnalysisFoods10.3390/foods1320330013:20(3300)Online publication date: 17-Oct-2024
https://doi.org/10.3390/foods13203300
Sánchez-DelaCruz EAbdul-Kareem SPozos-Parra P(2024)PS-Merge operator in the classification of gait biomarkers: A preliminary approach to eXplainable Artificial IntelligenceJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23505346:1(529-541)Online publication date: 10-Jan-2024
https://doi.org/10.3233/JIFS-235053
Kim JChung Y(2024)Clustering based Under-Sampling for Imbalanced Data ClassificationThe Journal of Korean Institute of Information Technology10.14801/jkiit.2024.22.5.5122:5(51-60)Online publication date: 31-May-2024
https://doi.org/10.14801/jkiit.2024.22.5.51
Show More Cited By

Feature selection for text categorization on imbalanced data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning

Recommendations

MMR-based feature selection for text categorization
HLT-NAACL-Short '04: Proceedings of HLT-NAACL 2004: Short Papers

We introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. ...
A General Framework of Feature Selection for Text Categorization
MLDM '09: Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition

Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection ...
Feature selection with conditional mutual information maximin in text categorization
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

Feature selection is an important component of text categorization. This technique can both increase a classifier's computation speed, and reduce the overfitting problem. Several feature selection methods, such as information gain and mutual information,...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 6, Issue 1

Special issue on learning from imbalanced datasets

June 2004

117 pages

ISSN:1931-0145

EISSN:1931-0153

DOI:10.1145/1007730

Issue’s Table of Contents

Copyright © 2004 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2004

Published in SIGKDD Volume 6, Issue 1

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

432
Total Citations
View Citations
3,407
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)10

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Adegbenjo ANgadi M(2024)Handling the Imbalanced Problem in Agri-Food Data AnalysisFoods10.3390/foods1320330013:20(3300)Online publication date: 17-Oct-2024
https://doi.org/10.3390/foods13203300
Sánchez-DelaCruz EAbdul-Kareem SPozos-Parra P(2024)PS-Merge operator in the classification of gait biomarkers: A preliminary approach to eXplainable Artificial IntelligenceJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23505346:1(529-541)Online publication date: 10-Jan-2024
https://doi.org/10.3233/JIFS-235053
Kim JChung Y(2024)Clustering based Under-Sampling for Imbalanced Data ClassificationThe Journal of Korean Institute of Information Technology10.14801/jkiit.2024.22.5.5122:5(51-60)Online publication date: 31-May-2024
https://doi.org/10.14801/jkiit.2024.22.5.51
Mujahid MKına ERustam FVillar MAlvarado EDe La Torre Diez IAshraf I(2024)Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineeringJournal of Big Data10.1186/s40537-024-00943-411:1Online publication date: 17-Jun-2024
https://doi.org/10.1186/s40537-024-00943-4
Chen ZSun A(2024)DP-GCN: Node Classification by Connectivity and Local Topology Structure on Real-World NetworkACM Transactions on Knowledge Discovery from Data10.1145/364946018:6(1-20)Online publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1145/3649460
Gao CShi YLo DGamess E(2024)Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and BiasProceedings of the 2024 ACM Southeast Conference10.1145/3603287.3651191(235-240)Online publication date: 18-Apr-2024
https://dl.acm.org/doi/10.1145/3603287.3651191
Atoum MAlarood AAlsolmi EObeidat AAlazab M(2024)Predictive Analysis of Global Terrorist Attacks Using Lexical Patterns Across Multiple DatasetsExpert Systems10.1111/exsy.1380842:1Online publication date: 12-Dec-2024
https://doi.org/10.1111/exsy.13808
Seabra AVentura RAlmeida RVieira SSousa J(2024)Applications of Autonomous Learning Multi Model System to Multiclass Imbalanced Datasets2024 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ-IEEE60900.2024.10611834(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/FUZZ-IEEE60900.2024.10611834
Leng HZhang ZChen CChen C(2024)A class-imbalanced hybrid learning strategy based on Raman spectroscopy of serum samples for the diagnosis of hepatitis B, hepatitis A, and thyroid dysfunctionSpectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy10.1016/j.saa.2024.124581(124581)Online publication date: May-2024
https://doi.org/10.1016/j.saa.2024.124581
K.A. AP. VK.A. RRaveendran NConti M(2024)Android malware defense through a hybrid multi-modal approachJournal of Network and Computer Applications10.1016/j.jnca.2024.104035(104035)Online publication date: Sep-2024
https://doi.org/10.1016/j.jnca.2024.104035
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents