skip to main content
article

Feature selection for text categorization on imbalanced data

Published: 01 June 2004 Publication History

Abstract

A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.

References

[1]
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. Proceedigs of the Seventh International Conference on Information and Knowledge Management, pages 148--155, 1998.
[2]
T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3):291--316, 1997.
[3]
G. Forman. An extensive empirical study of feature selection metrics for text classification. JMLR, Special Issue on Variable and Feature Selection, pages 1289--1305, 2003.
[4]
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 2002.
[5]
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1--2):273--324, 1997.
[6]
D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis, and Information Retrieval, pages 81--93, 1994.
[7]
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 1998.
[8]
D. Mladeni. Machine Learning on non-homogeneous, distributed text data. PhD Dissertation, University of Ljubljana, Slovenia, 1998.
[9]
D. Mladeni and G. Marko. Feture selection for unbalanced class distribution and naive bayes. The Sixteenth International Conference on Machine Learning, pages 258--267, 1999.
[10]
H. Ng, W. Goh, and K. Low. Feature selection, perceptron learning, and a usability case study for text categorization. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 67--73, 1997.
[11]
V. Rijsbergen. Information Retrieval. Butterworths, London, 1979.
[12]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
[13]
A. Singhal, M. Mitra, and C. Buckley. Learning routing queries in a query zone. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pages 25--32, Philadelphia, US, 1997.
[14]
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67--88, 1999.
[15]
Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. The Fourteenth International Conference on Machine Learning, pages 412--420, 1997.
[16]
J. Zhang and Y. Yang. Robustness of regularized linear classification methods in text categorization. ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.
[17]
T. Zhang and F. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4:5--31, 2001.

Cited By

View all
  • (2024)Handling the Imbalanced Problem in Agri-Food Data AnalysisFoods10.3390/foods1320330013:20(3300)Online publication date: 17-Oct-2024
  • (2024)PS-Merge operator in the classification of gait biomarkers: A preliminary approach to eXplainable Artificial IntelligenceJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23505346:1(529-541)Online publication date: 10-Jan-2024
  • (2024)Clustering based Under-Sampling for Imbalanced Data ClassificationThe Journal of Korean Institute of Information Technology10.14801/jkiit.2024.22.5.5122:5(51-60)Online publication date: 31-May-2024
  • Show More Cited By
  1. Feature selection for text categorization on imbalanced data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGKDD Explorations Newsletter
    ACM SIGKDD Explorations Newsletter  Volume 6, Issue 1
    Special issue on learning from imbalanced datasets
    June 2004
    117 pages
    ISSN:1931-0145
    EISSN:1931-0153
    DOI:10.1145/1007730
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2004
    Published in SIGKDD Volume 6, Issue 1

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)77
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 07 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Handling the Imbalanced Problem in Agri-Food Data AnalysisFoods10.3390/foods1320330013:20(3300)Online publication date: 17-Oct-2024
    • (2024)PS-Merge operator in the classification of gait biomarkers: A preliminary approach to eXplainable Artificial IntelligenceJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23505346:1(529-541)Online publication date: 10-Jan-2024
    • (2024)Clustering based Under-Sampling for Imbalanced Data ClassificationThe Journal of Korean Institute of Information Technology10.14801/jkiit.2024.22.5.5122:5(51-60)Online publication date: 31-May-2024
    • (2024)Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineeringJournal of Big Data10.1186/s40537-024-00943-411:1Online publication date: 17-Jun-2024
    • (2024)DP-GCN: Node Classification by Connectivity and Local Topology Structure on Real-World NetworkACM Transactions on Knowledge Discovery from Data10.1145/364946018:6(1-20)Online publication date: 12-Apr-2024
    • (2024)Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and BiasProceedings of the 2024 ACM Southeast Conference10.1145/3603287.3651191(235-240)Online publication date: 18-Apr-2024
    • (2024)Predictive Analysis of Global Terrorist Attacks Using Lexical Patterns Across Multiple DatasetsExpert Systems10.1111/exsy.1380842:1Online publication date: 12-Dec-2024
    • (2024)Applications of Autonomous Learning Multi Model System to Multiclass Imbalanced Datasets2024 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ-IEEE60900.2024.10611834(1-8)Online publication date: 30-Jun-2024
    • (2024)A class-imbalanced hybrid learning strategy based on Raman spectroscopy of serum samples for the diagnosis of hepatitis B, hepatitis A, and thyroid dysfunctionSpectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy10.1016/j.saa.2024.124581(124581)Online publication date: May-2024
    • (2024)Android malware defense through a hybrid multi-modal approachJournal of Network and Computer Applications10.1016/j.jnca.2024.104035(104035)Online publication date: Sep-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media