skip to main content
article

A multistrategy approach for digital text categorization from imbalanced documents

Published: 01 June 2004 Publication History

Abstract

The goal of the research described here is to develop a multistrategy classifier system that can be used for document categorization. The system automatically discovers classification patterns by applying several empirical learning methods to different representations for preclassified documents belonging to an imbalanced sample. The learners work in a parallel manner, where each learner carries out its own feature selection based on evolutionary techniques and then obtains a classification model. In classifying documents, the system combines the predictions of the learners by applying evolutionary techniques as well. The system relies on a modular, flexible architecture that makes no assumptions about the design of learners or the number of learners available and guarantees the independence of the thematic domain.

References

[1]
Attardi G., Gulli A., Sebastiani F.: Automatic Web Page Categorization by Link and Content Analysis. Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence. Varese (1999) 105--119.
[2]
Brank, J., Groblenik, M., Milic-Frayling, N., Mladenic, D.: Interaction of Feature Selection Methods and Linear Classification Models. Proceedings of the Nineteenth International Conference on Machine Learning (ICML'02). Sydney, Australia (2002).
[3]
Castillo, Ma. D. del, Gasós, J., García-Alegre, M. C.: Genetic Processing of the Sensorial Information. Sensors & Actuators A, 37--38 (1993) 255--259.
[4]
Castillo, Ma. D. del, Barrios, L. J.: Knowledge Acquisition from Batch Semiconductor Manufacturing Data. Intelligent Data Analysis IDA, 3, Elsevier Science Inc. (1999) 399--408.
[5]
Castillo, Ma. D. del, Sesmero, P.: Perception and Representation in a Multistrategy Learning Process. Proceedings of Learning'00. Madrid (2000).
[6]
Cohen, W.: Text categorization and relational learning. Proceedings of the Twelfth International Conference on Machine Learning. Lake Tahoe, California (1995) 124--132.
[7]
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence, 118(1--2) (2000) 69--113.
[8]
Doan, A., Domingos, P., Halevy, A.: Learning to Match the Schemas of Data Sources: A Multistrategy Approach. Machine Learning, Vol. 50 (2003) 279--301.
[9]
Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M.: Inductive Learning Algorithms and Representation for Text Categorization. In CIKM-98: Proceedings of the Seventh International Conference on Information and Knowledge Management (1998) 148--155.
[10]
Freitag, D.: Multistrategy Learning for Information Extraction. Proceedings of the 15th International Conference on Machine Learning (1998) 161--169.
[11]
Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley (1989).
[12]
Grobelnik, M., Mladenic, D.: Efficient Text Categorization. Proceedings of the ECML-98 Text Mining Workshop (1998).
[13]
John, G. H., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problems. Proceedings of the 11th International Conference on Machine Learning (1994).
[14]
Langdon, W. B., Buxton, B. F.: Genetic Programming for Combining Classifiers. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001) (2001) 66--73.
[15]
Lewis, D.: Feature selection and feature extraction for text categorization. Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, Morgan Kaufmann, February (1992) 212--217.
[16]
Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Categorization. Symposium on Document Analysis and IR, ISRI, April 11--13, Las Vegas (1994) 81--93.
[17]
Michalski, R. S., Carbonell J. G., Mitchell T. M.: A theory and methodology of inductive learning. Machine Learning: An Artificial Intelligence Approach. Springer-Verlag (1983).
[18]
Mladenic, D.: Feature Subset Selection in Text-Learning. European Conference on Machine Learning (1998) 95--100.
[19]
Mladenic, D., Grobelnik, M.: Feature selection for classification based on text hierarchy. Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98 (1998).
[20]
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naïve Bayes. Proceedings of the 16th International Conference on Machine Learning (ICML'99) (1999) 258--267.
[21]
Oliveira, L. S.: Feature Selection Using Multi-Objective Genetic Algorithms for Hand-written Digit Recognition, ICPR (2002).
[22]
Porter, M. F.: An algorithm for suffix stripping. Program, 14(3) (1980) 130--137.
[23]
Quinlan J. R.: C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann (1993).
[24]
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, Number 1 (2002) 1--47.
[25]
Yang, Y., Pedersen, J. P.: A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97) (1997) 412--420.
[26]
Yang, J. and Honavar, V.: Feature subset selection using a genetic algorithm. IEEE Intelligent Systems and their Applications. 13(2) (1998) 44--49.

Cited By

View all
  • (2023)Optimal Feature Selection for Imbalanced Text ClassificationIEEE Transactions on Artificial Intelligence10.1109/TAI.2022.31446514:1(135-147)Online publication date: Feb-2023
  • (2021)An Enhanced Cos-Neuro Bio-Inspired Approach for Document ClusteringIntelligent Computing and Innovation on Data Science10.1007/978-981-16-3153-5_54(511-523)Online publication date: 28-Sep-2021
  • (2020)Forecasting the power consumption of a rotor spinning machine by using an adaptive squeeze and excitation convolutional neural network with imbalanced dataJournal of Cleaner Production10.1016/j.jclepro.2020.122864(122864)Online publication date: Jul-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 6, Issue 1
Special issue on learning from imbalanced datasets
June 2004
117 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1007730
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2004
Published in SIGKDD Volume 6, Issue 1

Check for updates

Author Tags

  1. feature selection
  2. genetic algorithms
  3. multistrategy learning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Optimal Feature Selection for Imbalanced Text ClassificationIEEE Transactions on Artificial Intelligence10.1109/TAI.2022.31446514:1(135-147)Online publication date: Feb-2023
  • (2021)An Enhanced Cos-Neuro Bio-Inspired Approach for Document ClusteringIntelligent Computing and Innovation on Data Science10.1007/978-981-16-3153-5_54(511-523)Online publication date: 28-Sep-2021
  • (2020)Forecasting the power consumption of a rotor spinning machine by using an adaptive squeeze and excitation convolutional neural network with imbalanced dataJournal of Cleaner Production10.1016/j.jclepro.2020.122864(122864)Online publication date: Jul-2020
  • (2020)Data stream classification: a reviewIran Journal of Computer Science10.1007/s42044-020-00061-3Online publication date: 27-May-2020
  • (2018)A hybrid approach for classification of rare class dataKnowledge and Information Systems10.1007/s10115-017-1114-556:1(197-221)Online publication date: 1-Jul-2018
  • (2016)A Survey of Predictive Modeling on Imbalanced DomainsACM Computing Surveys10.1145/290707049:2(1-50)Online publication date: 13-Aug-2016
  • (2016)A New Categorization Numerical Scheme for Mobile Robotic Computing Using Odor Data-set Recognition as a CaseProcedia Computer Science10.1016/j.procs.2016.08.03194(199-206)Online publication date: 2016
  • (2016)A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTEArabian Journal for Science and Engineering10.1007/s13369-016-2179-241:8(3255-3266)Online publication date: 12-May-2016
  • (2016)Distance Metric Based Oversampling Method for Bioinformatics and Performance EvaluationJournal of Medical Systems10.1007/s10916-016-0516-340:7(1-9)Online publication date: 1-Jul-2016
  • (2015)Development of an Opinion Blog Mining System2015 4th International Conference on Advanced Computer Science Applications and Technologies (ACSAT)10.1109/ACSAT.2015.59(74-79)Online publication date: Dec-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media