skip to main content
10.1145/1516241.1516336acmconferencesArticle/Chapter ViewAbstractPublication PagesicuimcConference Proceedingsconference-collections
research-article

MDS: a novel method for class imbalance learning

Published: 15 February 2009 Publication History

Abstract

Lots of real-world data sets have imbalanced class distributions in which almost all examples belong to one class and far fewer instances belong to others. Compared with the majority examples, the minority examples are usually more interesting class, such as rare diseases in diagnosis data, failures in inspection data, frauds in credit screening data, and so on. A classifier induced from an imbalanced data set has high classification accuracy for the majority class, but an unacceptable error rate for the minority class. This situation is called class imbalance problem and has attracted lots of attentions of researchers in data mining area. To solve this problem, this work proposed a novel method, called Mahalanobis Distance based sampling (MDS) methodology. Experimental results indicated the proposed MDS have a better performance in identifying the minority class compared with traditional techniques, under-sampling, cost-adjusting, and cluster based sampling.

References

[1]
Berry, M. and Linoff, G., 1997. Data Mining Techniques: Fro Marketing, Sales, and Customer Support. New York: John Wiley and Sons.
[2]
Desai, V. S., Crook, J. N., and Overstreet, G. A., 1996. A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operation Research. 95, 24--37.
[3]
Su, C.-T. and Hsiao, Y.-H., 2007. An Evaluation of the Robustness of MTS for Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering. 19, 10(October 2007), 1321--1332.
[4]
Weiss, G. M., 2004. Mining with rarity: a unifying framework, SIGKDD Exploration. 6, 1, 7--19.
[5]
Chen, M.-C., Chen, L.-S., C.-C., Hsu, and Zeng, W.-R., 2008. An information granulation based data mining approach for classifying imbalanced data. Information Sciences. 178, 16, 3214--3227.
[6]
Su, C.-T., Chen, L.-S. and Yih, Y., 2006a. Knowledge acquisition through information granulation for imbalanced data. Expert System with Applications, 31, 3, 531--541.
[7]
Su, C.-T., Chen, L.-S., and Chiang, T.-L., 2006b. A neural network based information granulation approach to shorten the cellular phone test process. Computers In Industry, 57, 5, 412--423.
[8]
Xie, J. G., and Qiu, Z. D., 2007. The effect of imbalanced data sets on LDA: a theoretical and empirical analysis. Pattern Recognition. 40, 2, 557--562.
[9]
Altincay, H. and Ergun, C., 2004. Clustering based under-sampling for improving speaker verification decisions using AdaBoost. Lecture Notes in Computer Science. 3138, 698--706.
[10]
Weiss G. M., and Provost F., 2001. The Effect of Class Distribution on Classifier Learning. Technical Report ML-TR-43, Department of Computer Science, Rutgers University.
[11]
Manevitz, L. M. and Yousef, M., 2001, One-class SVMs fro document classification, Journal of Machine Learning Research, 2, pp. 139--154.
[12]
Press, S. J., and Wilson, S., 1978. Chossing between logistic regression and discriminant analysis. Journal of the American Statistical Association, 699--705.
[13]
Desai, V. S., Crook, J. N., and Overstreet, G. A., 1996. A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operation Research, 95, 24--37.
[14]
Liao, T. W, 2008. Classification of weld flaws with imbalanced class data. Expert Systems with Applications. 35, 3, 1041--1052.
[15]
Quinlan, J. R., 1993. C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA.
[16]
Quinlan, J. R., 1986. Induction of decision tree. Machine Learning. 1, 1, 88--106.
[17]
Batista, G., Prati, R. C., and Monard, M. C., 2004. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations. 6, 1, 20--29.
[18]
Estabrooks, A, Jo, T. and Japkowicz, N., 2004. A multiple resampling methods for learning from imbalanced data sets. Computational Intelligence. 20, 1, 18--36.
[19]
Guo, H. and Viktor, H. L., 2004, Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explorations. 6, 1, 30--39.
[20]
Provost, F. and Fawcett, T., 2001, Robust classification for imprecise environments, Machine Learning, 42, pp. 203--231.
[21]
Radivojac, P., N. C. Chawla, A. K. Dunker and Z. Obradovic, 2004. Classification and knowledge discovery in protein databases. Journal of Biomedical Informatics. 37, 224--239.

Cited By

View all
  • (2016)Laughter detection using data mining and human feedback2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS)10.1109/ICSESS.2016.7883009(25-28)Online publication date: Aug-2016
  • (2015)Deep Learning with MCA-based Instance Selection and Bootstrapping for Imbalanced Data ClassificationProceedings of the 2015 IEEE Conference on Collaboration and Internet Computing (CIC)10.1109/CIC.2015.40(288-295)Online publication date: 27-Oct-2015
  • (2012)An Improved SVM-KM Model for Imbalanced DatasetsProceedings of the 2012 International Conference on Industrial Control and Electronics Engineering10.1109/ICICEE.2012.35(100-103)Online publication date: 23-Aug-2012
  • Show More Cited By

Recommendations

Reviews

Jan De Beule

The class imbalance problem refers to the fact that, in real-world data, there is often a majority class of examples and a minority class of examples. Such a dataset is called an imbalanced dataset. A classifier induced from such a set has high classification accuracy for majority examples, but also a high error rate for minority examples. Solving a typical imbalance problem is done by either an algorithm/model-oriented approach or by data manipulation techniques. This paper discusses a novel approach to tackle the imbalanced data problem. The proposed method, Mahalanobis distance-based sampling (MDS), is very technical and is clearly explained in the paper. Chen, Hsu, and Chang compare their method with existing ones, and conclude that their method can drastically improve the classification ability for imbalanced data. However, certain details should be further investigated and should motivate future research. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICUIMC '09: Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication
February 2009
704 pages
ISBN:9781605584058
DOI:10.1145/1516241
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Mahalanobis distance
  2. class imbalance problem
  3. classification
  4. data mining
  5. imbalanced data

Qualifiers

  • Research-article

Funding Sources

Conference

ICUIMC '09
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2016)Laughter detection using data mining and human feedback2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS)10.1109/ICSESS.2016.7883009(25-28)Online publication date: Aug-2016
  • (2015)Deep Learning with MCA-based Instance Selection and Bootstrapping for Imbalanced Data ClassificationProceedings of the 2015 IEEE Conference on Collaboration and Internet Computing (CIC)10.1109/CIC.2015.40(288-295)Online publication date: 27-Oct-2015
  • (2012)An Improved SVM-KM Model for Imbalanced DatasetsProceedings of the 2012 International Conference on Industrial Control and Electronics Engineering10.1109/ICICEE.2012.35(100-103)Online publication date: 23-Aug-2012
  • (2011)Evaluating Integrated Weight Linear method to class imbalanced learning in video data2011 3rd Conference on Data Mining and Optimization (DMO)10.1109/DMO.2011.5976535(243-247)Online publication date: Jun-2011

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media