skip to main content
10.1145/1390817.1390822acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Can data transformation help in the detection of fault-prone modules?

Published: 20 July 2008 Publication History

Abstract

Data preprocessing (transformation) plays an important role in data mining and machine learning. In this study, we investigate the effect of four different preprocessing methods to fault-proneness prediction using nine datasets from NASA Metrics Data Programs (MDP) and ten classification algorithms. Our experiments indicate that log transformation rarely improves classification performance, but discretization affects the performance of many different algorithms. The impact of different transformations differs. Random forest algorithm, for example, performs better with original and log transformed data set. Boosting and NaiveBayes perform significantly better with discretized data. We conclude that no general benefit can be expected from data transformations. Instead, selected transformation techniques are recommended to boost the performance of specific classification algorithms.

References

[1]
The R Project for Statistical Computing, available http://www.r-project.org/.
[2]
Metric data program. NASA Independent Verification and Validation facility, Available from http://MDP.ivv.nasa.gov.
[3]
L. Breiman. Random forests. Machine Learning, 45:5--32, 2001.
[4]
W. J. Conover. Practical Nonparametric Statistics. John Wiley and Sons, Inc., 1999.
[5]
J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 2006.
[6]
J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In International Conference on Machine Learning, pages 194--202, 1995.
[7]
J. J. Faraway. Practical Regression and Anova using R. online, July 2002.
[8]
U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, pages 1022--1027, 1993.
[9]
I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Springer, 2006.
[10]
Y. Jiang, B. Cukic, and T. Menzies. Fault prediction using early lifecycle data, pages 237--246. Software Reliability, 2007. ISSRE '07. The 18th IEEE International Symposium on, Nov. 2007.
[11]
I. Jolliffe. Principal Component Analysis. Springer, New York, 2002.
[12]
T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2--13, January 2007.
[13]
S. Siegel. Nonparametric Satistics. New York: McGraw-Hill Book Company, Inc., 1956.
[14]
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, Los Altos, US, 2005.

Cited By

View all
  • (2025)Using Permutation-Based Feature Importance for Improved Machine Learning Model Performance at Reduced CostsIEEE Access10.1109/ACCESS.2025.354462513(36421-36435)Online publication date: 2025
  • (2025)Seismic response prediction of composite plate shear walls- concrete filled (C-PSW/CF) using machine learning methodsEngineering Structures10.1016/j.engstruct.2024.119228322(119228)Online publication date: Jan-2025
  • (2024)Multivariate Analysis and Anomaly Detection of a US Reservoir Sedimentation Data SetJournal of Hydrologic Engineering10.1061/JHYEFF.HEENG-620629:5Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEFECTS '08: Proceedings of the 2008 workshop on Defects in large software systems
July 2008
48 pages
ISBN:9781605580517
DOI:10.1145/1390817
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

ISSTA '08
Sponsor:

Upcoming Conference

ISSTA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Using Permutation-Based Feature Importance for Improved Machine Learning Model Performance at Reduced CostsIEEE Access10.1109/ACCESS.2025.354462513(36421-36435)Online publication date: 2025
  • (2025)Seismic response prediction of composite plate shear walls- concrete filled (C-PSW/CF) using machine learning methodsEngineering Structures10.1016/j.engstruct.2024.119228322(119228)Online publication date: Jan-2025
  • (2024)Multivariate Analysis and Anomaly Detection of a US Reservoir Sedimentation Data SetJournal of Hydrologic Engineering10.1061/JHYEFF.HEENG-620629:5Online publication date: Oct-2024
  • (2024)A cross-network node classification method in open-set scenarioPattern Recognition10.1016/j.patcog.2024.110718155(110718)Online publication date: Nov-2024
  • (2024)The untold impact of learning approaches on software fault-proneness predictions: an analysis of temporal aspectsEmpirical Software Engineering10.1007/s10664-024-10454-829:4Online publication date: 8-Jun-2024
  • (2024)Enhancing Software Defect Prediction: Exploring the Predictive Power of Two Data Flow MetricsEvaluation of Novel Approaches to Software Engineering10.1007/978-3-031-64182-4_13(271-295)Online publication date: 10-Jul-2024
  • (2024)Data transformations cause altered edaphic‐climatic controls and reduced predictability on soil carbon decomposition ratesSoil Science Society of America Journal10.1002/saj2.2075988:6(1971-1982)Online publication date: 10-Sep-2024
  • (2023)Identification of Individual Hanwoo Cattle by Muzzle Pattern Images through Deep LearningAnimals10.3390/ani1318285613:18(2856)Online publication date: 8-Sep-2023
  • (2023)Betonarme perdelerde enerji sönümleme kapasitesinin meta-modelleme yöntemleriyle incelenmesiGazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi10.17341/gazimmfd.111782038:4(2311-2324)Online publication date: 12-Apr-2023
  • (2023)Machine learning-based delta check method for detecting misidentification errors in tumor marker testsClinical Chemistry and Laboratory Medicine (CCLM)10.1515/cclm-2023-118562:7(1421-1432)Online publication date: 14-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media