skip to main content
10.1145/3170521.3170535acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesworkshops-icdcnConference Proceedingsconference-collections
research-article

Imbalanced big data classification: a distributed implementation of SMOTE

Published:04 January 2018Publication History

ABSTRACT

In the domain of machine learning, quality of data is most critical component for building good models. Predictive analytics is an AI stream used to predict future events based on historical learnings and is used in diverse fields like predicting online frauds, oil slicks, intrusion attacks, credit defaults, prognosis of disease cells etc. Unfortunately, in most of these cases, traditional learning models fail to generate required results due to imbalanced nature of data. Here imbalance denotes small number of instances belonging to the class under prediction like fraud instances in the total online transactions. The prediction in imbalanced classification gets further limited due to factors like small disjuncts which get accentuated during the partitioning of data when learning at scale. Synthetic generation of minority class data (SMOTE [<u>1</u>]) is one pioneering approach by Chawla [<u>1</u>] to offset said limitations and generate more balanced datasets. Although there exists a standard implementation of SMOTE in python, it is unavailable for distributed computing environments for large datasets. Bringing SMOTE to distributed environment under spark is the key motivation for our research. In this paper we present our algorithm, observations and results for synthetic generation of minority class data under spark using Locality Sensitivity Hashing [LSH]. We were able to successfully demonstrate a distributed version of Spark SMOTE which generated quality artificial samples preserving spatial distribution1.

References

  1. Chawla Nitesh, et al. 2002. Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Yu H, Hong S, Yang X, Ni J, Dan Y, Qin B. 2013. Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifiers. BioMed Research International, 1--13.Google ScholarGoogle Scholar
  3. Elhag S, Fernández A, Bawakid A, Alshomrani S, Herrera F. 2015. On the combination of genetic fuzzy systems and pairwise learning for improving detection rates on intrusion detection systems. Expert Syst Appl 42(1), 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. He H, García E A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263--1284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sun Y, Wong, Andrew and Mohamed Kamel, 2009. Classification of Imbalanced Data: A Review. International Journal of Pattern Recognition and Artificial Intelligence. 23, 04, 687--719.Google ScholarGoogle ScholarCross RefCross Ref
  6. Fernández, A., Chawla, Nitesh, García, S., Palade, V., Herrera, F. 2017 An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Complex and Intelligent Systems 250(20), 113--141.Google ScholarGoogle Scholar
  7. Batista GEAPA, Prati RC, Monard, MC. 2004. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 1, 20--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ramentol E, Vluymans S, Verbiest N, Caballero Y, Bello R, Cornelis C, Herrera F. 2015. IFROWANN: imbalanced fuzzy-rough ordered weighted average nearest neighbor classification. IEEE Trans Fuzzy Systems 23(5), 1622--1637.Google ScholarGoogle ScholarCross RefCross Ref
  9. Domingos P. 1999 Metacost: A general method for making classifiers cost-sensitive. Proceedings of the 5th international conference on knowledge discovery and data mining (KDD'99), 155--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Laney. 2001. 3D data management: Controlling data volume, velocity, and variety. Tech. rep., META Group.Google ScholarGoogle Scholar
  11. Apache Spark https://spark.apache.org/docs/latest/index.html.Google ScholarGoogle Scholar
  12. Prati, R.C., G.E. Batista, and M.C. Monard. 2004. Learning with class skews and small disjuncts, Advances in Artificial Intelligence-SBIA, Springer, 296--306.Google ScholarGoogle Scholar
  13. Jo, T. and N. Japkowicz. 2004. Class imbalances versus small disjuncts. SIGKDD Explorer Newsletter. 6(1), 40--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Alejo, R., et al. 2013. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognition Letters. 34(4), 380--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. García, S. and F. Herrera. 2009 Evolutionary undersampling for classification with imbalanced datasets. Proposals and taxonomy. Evolutionary Computation. 17(3), 275--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V. 2016. A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131, 191--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hu F, Li H, Lou H, Dai J. 2014. A parallel oversampling algorithm based on NRSBoundary-SMOTE. Journal of Information & Computer Science 11(13), 4655--4665.Google ScholarGoogle ScholarCross RefCross Ref
  18. Antonin Guttman. 1984 R-trees: A dynamic index structure for spatial searching. SIGMOD Conference, 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jon Louis Bentley. 1990. K-d trees for semidynamic point sets. Symposium on Computational Geometry. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Lu, Y. Shen, S. Chen, and B. C. Ooi. 2012. Efficient processing of k nearest neighbor joins using mapreduce. Proceedings of VLDB Endow. Vol. 5, no. 10, 1016--1027. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Bahmani, B., Moseley, B., Vattani, A., Ravi Kumar, Vassilvitskii, S. 2012. Scalable K means ++. Journal Proceedings of the VLDB Endowment. Vol 5 Issue 7, 622--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Indyk, P., Motwani, R. 1998. Approximate Nearest Neighbours: Towards Removing the Curse of Dimensionality. Proceedings of the thirtieth annual ACM symposium on Theory of computing, 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Rajaraman, Jure Leskovec, Jeffrey D. Ullman. 2014. Mining of Massive Datasets. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Slaney, M., Casey, M., 2008. Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Machine. 129--131.Google ScholarGoogle Scholar
  25. Sundaramy, N., Turmukhametova, A., Satishy, N., Mostak, T, Indyk, P., Madden, S., and Dubey, P. 2013. Streaming Similarity Search over one Billion Tweets using Parallel Locality Sensitive Hashing. Proceedings of the VLDB Endowment, Vol. 6, No. 14, 1930--1941. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Liv, Q, Josephson, W., Wang, Z., Charikar, M., Li, K. 2007. Multi Probe LSH: Efficient Indexing for High Dimensional Similarity Search. Proceedings of the 33rd VLDB, 950--961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Datar, N. Immorlica, P. Indyk, V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th Symposium on Computational Geometry (SCG) 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Huang J, Ling CX. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3), 299--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. ECBDL'14 dataset. http://cruncher.ncl.ac.uk/bdcomp/Google ScholarGoogle Scholar
  30. Scikit Learn. http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html.Google ScholarGoogle Scholar
  31. J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera. 2011. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17:2-3, 255--287.Google ScholarGoogle Scholar
  32. Abalone dataset http://sci2s.ugr.es/keel/dataset.php?cod=115.Google ScholarGoogle Scholar
  33. Yeast dataset http://sci2s.ugr.es/keel/dataset.php?cod=133.Google ScholarGoogle Scholar
  34. H2o https://www.h2o.ai.Google ScholarGoogle Scholar
  35. Krawczyk, Bartosz. 2016. Learning from imbalanced data:Open challenges and future directions. Progress in Artificial Intelligence. Vol 5, Issue 4, 221--232.Google ScholarGoogle Scholar
  36. D.S. Huang, X.-P. Zhang, G.-B. Huang. 2005. Borderline-SMOTE A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, Part I, LNCS 3644. 878 -- 887. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Imbalanced big data classification: a distributed implementation of SMOTE

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      Workshops ICDCN '18: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking
      January 2018
      151 pages
      ISBN:9781450363976
      DOI:10.1145/3170521
      • Conference Chair:
      • Doina Bein

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 January 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader