skip to main content
10.1145/1180639.1180729acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
Article

Learning concepts from large scale imbalanced data sets using support cluster machines

Published: 23 October 2006 Publication History

Abstract

This paper considers the problem of using Support Vector Machines (SVMs) to learn concepts from large scale imbalanced data sets. The objective of this paper is twofold. Firstly, we investigate the effects of large scale and imbalance on SVMs. We highlight the role of linear non-separability in this problem. Secondly, we develop a both practical and theoretical guaranteed meta-algorithm to handle the trouble of scale and imbalance. The approach is named Support Cluster Machines (SCMs). It incorporates the informative and the representative under-sampling mechanisms to speedup the training procedure. The SCMs differs from the previous similar ideas in two ways, (a) the theoretical foundation has been provided, and (b) the clustering is performed in the feature space rather than in the input space. The theoretical analysis not only provides justification, but also guides the technical choices of the proposed approach. Finally, experiments on both the synthetic and the TRECVID data are carried out. The results support the previous analysis and show that the SCMs are efficient and effective while dealing with large scale imbalanced data sets.

References

[1]
TREC Video Retrieval. National Institute of Standards and Technology, http://www-nlpir.nist.gov/projects/trecvid/.
[2]
R. Akbani, S. Kwek, and N. Japkowicz. Applying Support Vector Machines to Imbalanced Datasets. In Proceedings of ECML'04, pages 39--50, 2004.
[3]
D. Boley and D. Cao. Training Support Vector Machine using Adaptive Clustering. In Proceeding of 2004 SIAM International Conference on Data Mining, April 2004.
[4]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
[5]
K. Brinker. Incorporating Diversity in Active Learning with Support Vector Machines. In Proceedings of ICML'03, pages 59--66, 2003.
[6]
N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explor. Newsl., 6(1):1--6, 2004.
[7]
J. W. Daniel. Stability of the Solution of Definite Quadratic Programs. Mathematical Programming, 5(1):41--53, December 1973.
[8]
R. Datta, J. Li, and J. Z. Wang. Content-based Image Retrieval: Approaches and Trends of the New Age. In Proceedings of ACM SIGMM workshop on MIR'05, pages 253--262, 2005.
[9]
I. Dhillon, Y. Guan, and B. Kulis. A Fast Kernel-based Multilevel Algorithm for Graph Clustering. In Proceeding of ACM SIGKDD'05, pages 629--634, 2005.
[10]
I. S. Dhillon, Y. Guan, and B. Kulis. A Unified View of Graph Partitioning and Weighted Kernel k-means. Technical Report TR-04-25, The University of Texas at Austin, Department of Computer Sciences, June 2004.
[11]
C. Ding and X. He. K-means clustering via principal component analysis. In Proceedings of ICML'04, pages 29--36, 2004.
[12]
K.-S. Goh, E. Y. Chang, and W.-C. Lai. Multimodal Concept-dependent Active Learning for Image Retrieval. In Proceedings of ACM MM'04, pages 564--571, 2004.
[13]
A. G. Hauptmann. Towards a Large Scale Concept Ontology for Broadcast Video. In Proceedings of CIVR'04, pages 674--675, 2004.
[14]
A. G. Hauptmann. Lessons for the Future from a Decade of Informedia Video Analysis Research. In Proceedings of CIVR'05, pages 1--10, 2005.
[15]
C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A Practical Guide to Support Vector Classification. 2005. available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[16]
T. Joachims. Making Large-scale Support Vector Machine Learning Practical. Advances in kernel methods: support vector learning, pages 169--184, 1999.
[17]
Y. Lin, Y. Lee, and G. Wahba. Support Vector Machines for Classification in Nonstandard Situations. Machine Learning, 46(1-3):191--202, 2002.
[18]
L. M. Manevitz and M. Yousef. One-class SVMs for Document Classification. Journal of Machine Learning Research, 2:139--154, 2002.
[19]
H. T. Nguyen and A. Smeulders. Active Learning Using Pre-clustering. In Proceedings of ICML'04, pages 79--86, 2004.
[20]
E. Osuna, R. Freund, and F. Girosi. An Improved Training Algorithm for Support Vector Machines. In IEEE Workshop on Neural Networks and Signal Processing, September 1997.
[21]
D. Pavlov, J. Mao, and B. Dom. Scaling-Up Support Vector Machines Using Boosting Algorithm. In Proceeding of ICPR'00, volume 2, pages 2219--2222, 2000.
[22]
J. C. Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in kernel methods: support vector learning, pages 185--208, 1999.
[23]
R. C. Prati, G. E. A. P. A. Batista, and M. C. Monard. Class Imbalances versus Class Overlapping: an Analysis of a Learning System Behavior. In Proceedings of the MICAI 2004, pages 312--321, 2004.
[24]
G. Schohn and D. Cohn. Less is More: Active Learning with Support Vector Machines. In Proceddings of ICML'00, pages 839--846, 2000.
[25]
A. J. Smola and B. Schlokopf. Sparse Greedy Matrix Approximation for Machine Learning. In Proceedings of ICML'00, pages 911--918, 2000.
[26]
S. Tong and E. Chang. Support Vector Machine Active Learning for Image Retrieval. In Proceedings of ACM MM'01, pages 107--118, 2001.
[27]
K. Veropoulos, N. Cristianini, and C. Campbell. Controlling the Sensitivity of Support Vector Machines. In Proceedings of IJCAI'99, 1999.
[28]
G. M. Weiss and F. J. Provost. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research (JAIR), 19:315--354, 2003.
[29]
G. Wu and E. Y. Chang. KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution. IEEE Transactions on Knowledge and Data Engineering, 17(6):786--795, 2005.
[30]
Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang. Representative Sampling for Text Classification Using Support Vector Machines. In Proceedings of ECIR'03, pages 393--407, 2003.
[31]
H. Yu, J. Yang, J. Han, and X. Li. Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing. Data Min. Knowl. Discov., 11(3):295--321, 2005.

Cited By

View all
  • (2024)Methods for class-imbalanced learning with support vector machines: a review and an empirical evaluationSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-024-09931-528:20(11873-11894)Online publication date: 1-Oct-2024
  • (2023)A Best Balance Ratio Ordered Feature Selection Methodology for Robust and Fast Statistical Analysis of Memory DesignsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.321376242:6(1742-1755)Online publication date: Jun-2023
  • (2023)Prediction of Oncology Drug Targets Based on Ensemble Learning and Sample Weight Updating2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM58861.2023.10385773(3602-3609)Online publication date: 5-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '06: Proceedings of the 14th ACM international conference on Multimedia
October 2006
1072 pages
ISBN:1595934472
DOI:10.1145/1180639
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. concept modelling
  3. imbalance
  4. kernel k-means
  5. large scale
  6. support vector machines

Qualifiers

  • Article

Conference

MM06
MM06: The 14th ACM International Conference on Multimedia 2006
October 23 - 27, 2006
CA, Santa Barbara, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Methods for class-imbalanced learning with support vector machines: a review and an empirical evaluationSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-024-09931-528:20(11873-11894)Online publication date: 1-Oct-2024
  • (2023)A Best Balance Ratio Ordered Feature Selection Methodology for Robust and Fast Statistical Analysis of Memory DesignsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.321376242:6(1742-1755)Online publication date: Jun-2023
  • (2023)Prediction of Oncology Drug Targets Based on Ensemble Learning and Sample Weight Updating2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM58861.2023.10385773(3602-3609)Online publication date: 5-Dec-2023
  • (2023)A broad review on class imbalance learning techniquesApplied Soft Computing10.1016/j.asoc.2023.110415143(110415)Online publication date: Aug-2023
  • (2022)Chinese Language and Literature Online Resource Classification Algorithm Based on Improved SVMScientific Programming10.1155/2022/43735482022Online publication date: 1-Jan-2022
  • (2020)Important sampling based active learning for imbalance classificationScience China Information Sciences10.1007/s11432-019-2771-063:8Online publication date: 7-Jul-2020
  • (2017)Automation of image categorization with most relevant negativesPattern Recognition and Image Analysis10.1134/S105466181703005127:3(371-379)Online publication date: 1-Jul-2017
  • (2016)Oversampling the Minority Class in the Feature SpaceIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2015.246143627:9(1947-1961)Online publication date: Sep-2016
  • (2016)Mixed-kernel based weighted extreme learning machine for inertial sensor based human activity recognition with imbalanced datasetNeurocomputing10.1016/j.neucom.2015.11.095190:C(35-49)Online publication date: 19-May-2016
  • (2016)Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasetsNeurocomputing10.1016/j.neucom.2014.05.096172(198-206)Online publication date: Jan-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media