ABSTRACT
Data preprocessing describes any type of processing methods performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data preprocessing methods transforms the data into a format that will be more easily and effectively processed for the classification algorithms. In this paper, a novel data preprocessing method is proposed and evaluated in three difficult classification data sets of the well known UCI Repository, in which various classifiers have average performance lower than 75%. The three UCI repository datasets that have been used are the Mammographic masses, Indian Liver and Contraceptive Method. The performance of our proposed data preprocessing method and Principal Component Analysis preprocessing method was evaluated using the 10-fold cross validation method assessing five classification algorithms, Nearest-neighbour classifier (IB1), C4.5 algorithm implementation (J48), Random Forest, Multilayer Perceptron and Rotation Forest, respectively. The classification results are presented and compared analytically. The results indicate that the generated features after our proposed preprocessing method implementation to the original dataset markedly improve the performance of the classification algorithms.
- I. H. Witten, E. Frank, M. Hall, A. Mark, Data Mining: Practical Machine Learning Tools and Techniques (3 ed.), Elsevier, 2011, ISBN 978-0-12-374856-0. Google ScholarDigital Library
- S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, "Data Preprocessing for Supervised Leaning", World Academy of Science, Engineering and Technology, vol. 1, 2007, pp. 856--861.Google Scholar
- https://archive.ics.uci.edu/ml/about.html, {Accessed 10 May 2015}.Google Scholar
- https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass. {Accessed 10 May 2015}.Google Scholar
- https://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29, {Accessed 10 May 2015}.Google Scholar
- https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice, {Accessed 10 May 2015}.Google Scholar
- R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model selection". Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, vol. 2, no. 12, 1995, pp. 1137--1143. Google ScholarDigital Library
- Waikato Environment for Knowledge Analysis, Data Mining Software in Java, available online: http://www.cs.waikato.ac.nz/ml/index.html, {Accessed 10 May 2015}.Google Scholar
Index Terms
- A Novel Machine Learning Data Preprocessing Method for Enhancing Classification Algorithms Performance
Recommendations
A novel data preprocessing method for boosting neural network performance
Data preprocessing methods have been used in Machine Learning classificationproblems, transforming datasets into a proper form in order to boost the classification performance.In thispaper,a novel data preprocessing method is proposed and evaluatedin a ...
Genetic algorithms in feature and instance selection
Feature selection and instance selection are two important data preprocessing steps in data mining, where the former is aimed at removing some irrelevant and/or redundant features from a given dataset and the latter at discarding the faulty data. ...
Mining of classification patterns in clinical data through data mining algorithms
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and InformaticsData mining on clinical data is a challenging area in the field of medical research, aiming at predicting and discovering patterns of disease occurrence and prognosis based on detected symptoms and reported health conditions. Data mining is the process ...
Comments