research-article

Imbalanced big data classification: a distributed implementation of SMOTE

Authors:
Avnish Kumar Rastogi

HCL Technologies, India

HCL Technologies, India
View Profile

,
Nitin Narang

HCL Technologies, India

HCL Technologies, India
View Profile

,
Zamir Ahmad Siddiqui

HCL Technologies, India

HCL Technologies, India
View Profile

Workshops ICDCN '18: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and NetworkingJanuary 2018Article No.: 14Pages 1–6https://doi.org/10.1145/3170521.3170535

Published:04 January 2018Publication History

Workshops ICDCN '18: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking

Pages 1–6

ABSTRACT

In the domain of machine learning, quality of data is most critical component for building good models. Predictive analytics is an AI stream used to predict future events based on historical learnings and is used in diverse fields like predicting online frauds, oil slicks, intrusion attacks, credit defaults, prognosis of disease cells etc. Unfortunately, in most of these cases, traditional learning models fail to generate required results due to imbalanced nature of data. Here imbalance denotes small number of instances belonging to the class under prediction like fraud instances in the total online transactions. The prediction in imbalanced classification gets further limited due to factors like small disjuncts which get accentuated during the partitioning of data when learning at scale. Synthetic generation of minority class data (SMOTE [<u>1</u>]) is one pioneering approach by Chawla [<u>1</u>] to offset said limitations and generate more balanced datasets. Although there exists a standard implementation of SMOTE in python, it is unavailable for distributed computing environments for large datasets. Bringing SMOTE to distributed environment under spark is the key motivation for our research. In this paper we present our algorithm, observations and results for synthetic generation of minority class data under spark using Locality Sensitivity Hashing [LSH]. We were able to successfully demonstrate a distributed version of Spark SMOTE which generated quality artificial samples preserving spatial distribution¹.

References

Chawla Nitesh, et al. 2002. Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321--357. Google ScholarDigital Library
Yu H, Hong S, Yang X, Ni J, Dan Y, Qin B. 2013. Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifiers. BioMed Research International, 1--13.Google Scholar
Elhag S, Fernández A, Bawakid A, Alshomrani S, Herrera F. 2015. On the combination of genetic fuzzy systems and pairwise learning for improving detection rates on intrusion detection systems. Expert Syst Appl 42(1), 193--202. Google ScholarDigital Library
He H, García E A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263--1284. Google ScholarDigital Library
Sun Y, Wong, Andrew and Mohamed Kamel, 2009. Classification of Imbalanced Data: A Review. International Journal of Pattern Recognition and Artificial Intelligence. 23, 04, 687--719.Google ScholarCross Ref
Fernández, A., Chawla, Nitesh, García, S., Palade, V., Herrera, F. 2017 An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Complex and Intelligent Systems 250(20), 113--141.Google Scholar
Batista GEAPA, Prati RC, Monard, MC. 2004. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 1, 20--29. Google ScholarDigital Library
Ramentol E, Vluymans S, Verbiest N, Caballero Y, Bello R, Cornelis C, Herrera F. 2015. IFROWANN: imbalanced fuzzy-rough ordered weighted average nearest neighbor classification. IEEE Trans Fuzzy Systems 23(5), 1622--1637.Google ScholarCross Ref
Domingos P. 1999 Metacost: A general method for making classifiers cost-sensitive. Proceedings of the 5th international conference on knowledge discovery and data mining (KDD'99), 155--164. Google ScholarDigital Library
D. Laney. 2001. 3D data management: Controlling data volume, velocity, and variety. Tech. rep., META Group.Google Scholar
Apache Spark https://spark.apache.org/docs/latest/index.html.Google Scholar
Prati, R.C., G.E. Batista, and M.C. Monard. 2004. Learning with class skews and small disjuncts, Advances in Artificial Intelligence-SBIA, Springer, 296--306.Google Scholar
Jo, T. and N. Japkowicz. 2004. Class imbalances versus small disjuncts. SIGKDD Explorer Newsletter. 6(1), 40--49. Google ScholarDigital Library
Alejo, R., et al. 2013. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognition Letters. 34(4), 380--388. Google ScholarDigital Library
García, S. and F. Herrera. 2009 Evolutionary undersampling for classification with imbalanced datasets. Proposals and taxonomy. Evolutionary Computation. 17(3), 275--306. Google ScholarDigital Library
Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V. 2016. A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131, 191--206. Google ScholarDigital Library
Hu F, Li H, Lou H, Dai J. 2014. A parallel oversampling algorithm based on NRSBoundary-SMOTE. Journal of Information & Computer Science 11(13), 4655--4665.Google ScholarCross Ref
Antonin Guttman. 1984 R-trees: A dynamic index structure for spatial searching. SIGMOD Conference, 47--57. Google ScholarDigital Library
Jon Louis Bentley. 1990. K-d trees for semidynamic point sets. Symposium on Computational Geometry. Google ScholarDigital Library
W. Lu, Y. Shen, S. Chen, and B. C. Ooi. 2012. Efficient processing of k nearest neighbor joins using mapreduce. Proceedings of VLDB Endow. Vol. 5, no. 10, 1016--1027. Google ScholarDigital Library
Bahmani, B., Moseley, B., Vattani, A., Ravi Kumar, Vassilvitskii, S. 2012. Scalable K means ++. Journal Proceedings of the VLDB Endowment. Vol 5 Issue 7, 622--633. Google ScholarDigital Library
Indyk, P., Motwani, R. 1998. Approximate Nearest Neighbours: Towards Removing the Curse of Dimensionality. Proceedings of the thirtieth annual ACM symposium on Theory of computing, 604--613. Google ScholarDigital Library
Rajaraman, Jure Leskovec, Jeffrey D. Ullman. 2014. Mining of Massive Datasets. Cambridge University Press. Google ScholarDigital Library
Slaney, M., Casey, M., 2008. Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Machine. 129--131.Google Scholar
Sundaramy, N., Turmukhametova, A., Satishy, N., Mostak, T, Indyk, P., Madden, S., and Dubey, P. 2013. Streaming Similarity Search over one Billion Tweets using Parallel Locality Sensitive Hashing. Proceedings of the VLDB Endowment, Vol. 6, No. 14, 1930--1941. Google ScholarDigital Library
Liv, Q, Josephson, W., Wang, Z., Charikar, M., Li, K. 2007. Multi Probe LSH: Efficient Indexing for High Dimensional Similarity Search. Proceedings of the 33rd VLDB, 950--961. Google ScholarDigital Library
M. Datar, N. Immorlica, P. Indyk, V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th Symposium on Computational Geometry (SCG) 253--262. Google ScholarDigital Library
Huang J, Ling CX. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3), 299--310. Google ScholarDigital Library
ECBDL'14 dataset. http://cruncher.ncl.ac.uk/bdcomp/Google Scholar
Scikit Learn. http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html.Google Scholar
J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera. 2011. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17:2-3, 255--287.Google Scholar
Abalone dataset http://sci2s.ugr.es/keel/dataset.php?cod=115.Google Scholar
Yeast dataset http://sci2s.ugr.es/keel/dataset.php?cod=133.Google Scholar
H2o https://www.h2o.ai.Google Scholar
Krawczyk, Bartosz. 2016. Learning from imbalanced data:Open challenges and future directions. Progress in Artificial Intelligence. Vol 5, Issue 4, 221--232.Google Scholar
D.S. Huang, X.-P. Zhang, G.-B. Huang. 2005. Borderline-SMOTE A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, Part I, LNCS 3644. 878 -- 887. Google ScholarDigital Library

Index Terms

Imbalanced big data classification: a distributed implementation of SMOTE
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. MapReduce algorithms

Recommendations

A Dynamic Spark-based Classification Framework for Imbalanced Big Data

Classification of imbalanced big data has assembled an extensive consideration by many researchers during the last decade. Standard classification methods poorly diagnosis the minority class samples. Several approaches have been introduced for solving ...
Read More
Applying Threshold SMOTE Algoritwith Attribute Bagging to Imbalanced Datasets
Proceedings of the 8th International Conference on Rough Sets and Knowledge Technology - Volume 8171

Synthetic minority over-sampling technique SMOTE is an effective over-sampling technique and specifically designed for learning from imbalanced data sets. However, in the process of synthetic sample generation, SMOTE is of some blindness. This paper ...
Read More
SMOTE-IPF

Noisy and borderline examples in imbalanced datasets harm classifier performance.Our proposal reduces the noise and makes the class boundaries more regular.Our proposal performs better than other re-sampling methods in this scenario.Ensemble-based noise ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
Workshops ICDCN '18: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking
January 2018
151 pages
ISBN:9781450363976
DOI:10.1145/3170521
Conference Chair:
Doina Bein
California State University
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SMOTE
imbalanced classification
locality sensitivity hashing
map reduce
nearest neighbors
spark
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 321
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Imbalanced big data classification: a distributed implementation of SMOTE

Workshops ICDCN '18: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Dynamic Spark-based Classification Framework for Imbalanced Big Data

Applying Threshold SMOTE Algoritwith Attribute Bagging to Imbalanced Datasets

SMOTE-IPF

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Imbalanced big data classification: a distributed implementation of SMOTE

Workshops ICDCN '18: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Dynamic Spark-based Classification Framework for Imbalanced Big Data

Applying Threshold SMOTE Algoritwith Attribute Bagging to Imbalanced Datasets

SMOTE-IPF

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media