Paper Title
Imbalanced Datasets in Defect Prediction
Abstract
Data pre-processing is important in software defect prediction. Despite achievements in defect prediction, data reliability remains an issues because of class imbalance; the success of a prediction study relies on the quality of data utilized. In this paper, we present a data pre-processing technique that can accurately identify defective and defect-free modules in a dataset and renders the dataset suitable for defect classification. We applied a top-down technique that considers datasets as a unit to identify both the defective and defect-free classes on 10 projects from the PROMISE repository. The support vector machine classifier achieved an average classification accuracy and specificity of 89.78% and 98.90%; the neural network classifier achieved an area under the receiver operating characteristic curve, Brier score, MCC, precision, and g-mean of 83.53%, 15.12%, 34.37%, 63.04%, and 41.87%; respectively; the naïve Bayes classifier achieved a recall, and a J-coefficient of 78.53% and 31.89%, respectively, and the K-nearest-neighbors classifier achieved an average information score of 36.14%. This manuscript calls for the need to properly pre- process datasets before they are applied in machine learning studies to avoid misleading results.
Keywords - Machine Learning, Data Pre-processing, Classification Algorithms, Defect Prediction, Imbalanced Data.