Random Permutation-based Hybrid Feature Selection for Software Bug Prediction using Bayesian Statistical Validation

Tamanna; Om Prakash Sangwan

doi:https://doi.org/10.14445/22490183/IJETT-V70I4P216

Research Article | Open Access | Download PDF

Volume 70 | Issue 4 | Year 2022 | Article Id. IJETT-V70I4P216 | DOI : https://doi.org/10.14445/22490183/IJETT-V70I4P216

Random Permutation-based Hybrid Feature Selection for Software Bug Prediction using Bayesian Statistical Validation

Tamanna, Om Prakash Sangwan

Received	Revised	Accepted	Published
12 Feb 2022	16 Mar 2022	18 Mar 2022	26 Apr 2022

Citation :

Tamanna, Om Prakash Sangwan, "Random Permutation-based Hybrid Feature Selection for Software Bug Prediction using Bayesian Statistical Validation," International Journal of Engineering Trends and Technology (IJETT), vol. 70, no. 4, pp. 188-202, 2022. Crossref, https://doi.org/10.14445/22490183/IJETT-V70I4P216

Abstract

Software Fault Prediction (SFP) is a key practice in developing quality software. To cater to rising human expectations, the software is getting complex and increasing source code size (adding new functionalities). A strategy like SFP can help detect faults beforehand and avoid software downtime. To reduce the cost of SFP, we propose a Permutation-based hybrid feature selection model (PFS). This model helps remove irrelevant and redundant features without compromising classifier performance. PFS has been compared with five different supervised feature selection methods – Chi-squared, Correlation, Sequential Forward Feature Selection, Sequential Backward Feature Selection, and Mutual Information. Random Forest (RF) classifier is employed, and experimental results (Accuracy, Precision, Recall, and AUC-ROC) were found on Twenty-four different datasets of three public software repositories. Bayesian statistical analysis of AUC-ROC results was carried out, and it was found that PFS was able to outperform other techniques by lower computational time and lower dimensions.

Keywords

Feature selection, Bayesian signed-rank test, ROC-AUC, Fault prediction.

References

[1] Devi CA, Kannammal KE, Surendiran B, A Hybrid Feature Selection Model for Software Bug Prediction. Int. J. Comput. Sci. Appl. 2(2) (2012) 25-35.
[2] Gayatri N, Nickolas S, Reddy AV, Performance Analysis and Enhancement of Software Quality Metrics Using Decision Tree-Based Feature Extraction, International Journal of Recent Trends in Engineering. 2(4) (2009) 1-54.
[3] The PROMISE Repository of Software Engineering Databases. [Online]. Available: http://promise.site.uottawa.ca/SERepository
[4] Khan B, Naseem R, Shah MA, Wakil K, Khan A, Uddin MI, Mahmoud M, Software Bug Prediction for Healthcare Big Data: An Empirical Evaluation of Machine Learning Techniques, Journal of Healthcare Engineering. 15 (2021) 2021.
[5] Menzies T, Greenwald J, Frank A, Data Mining Static Code Attributes to Learn Bug Predictors. IEEE Transactions on Software Engineering. 33(1) (2006) 2-13.
[6] Song Q, Jia Z, Shepperd M, Ying S, Liu J, A General Software Bug-Proneness Prediction Framework. IEEE Transactions on Software Engineering. 37(3) (2010) 356-70.
[7] Agarwal S, Tomar D. A, Feature Selection-Based Model for Software Bug Prediction, Assessment. (2014) 65.
[8] Liu S, Chen X, Liu W, Chen J, Gu Q, Chen D, FECAR: A Feature Selection Framework for Software Bug Prediction, In 2014 IEEE 38th Annual Computer Software and Applications Conference, IEEE. 21 (2014) 426-435.
[9] Khoshgoftaar TM, Gao K, Napolitano A, Wald R, A Comparative Study of Iterative and Non-Iterative Feature Selection Techniques for Software Bug Prediction, Information Systems Frontiers. 16(5) (2014) 801-22.
[10] Balogun AO, Basri S, Abdulkadir SJ, Hashim AS. Performance Analysis of Feature Selection Methods in Software Bug Prediction: A Search Method Approach, Applied Sciences. 9(13) (2019) 2764.
[11] Catal, Cagatay, and Banu Diri, Investigating the Effect of Dataset Size, Metrics Sets, and Feature Selection Techniques on Software Fault Prediction Problem, Information Sciences. 179(8) (2009) 1040-1058.
[12] Jakhar, Amit Kumar, and Kumar Rajnish, Software Fault Prediction with Data Mining Techniques by Using Feature Selection Based Models, International Journal on Electrical Engineering & Informatics. 10(3) (2018).
[13] Benavoli, Alessio, Giorgio Corani, Francesca Mangili, Marco Zaffalon, and Fabrizio Ruggeri, A Bayesian Wilcoxon Signed-Rank Test Based on the Dirichlet Process, in International Conference on Machine Learning, PMLR. (2014) 1026-1034.
[14] Benavoli, Alessio, Giorgio Corani, Janez Demšar, and Marco Zaffalon, Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. The Journal of Machine Learning Research. 18(1) (2017) 2653-2688.
[15] Raftery, Adrian E. Bayesian, Model Selection in Social Research, Sociological Methodology. (1995) 111-163.
[16] Wasserstein, Ronald L, and Nicole A. Lazar, The ASA Statement on P-Values: Context, Process, and Purpose. (2016) 129-133.
[17] Trafimow, David, Valentin Amrhein, Corson N. Areshenkoff, Carlos J. Barrera-Causil, Eric J. Beh, Yusuf K. Bilgiç, Roser Bono et al., Manipulating the Alpha Level Cannot Cure Significance Testing, Frontiers in Psychology. 9 (2018) 699.
[18] Ferguson, Thomas S, A Bayesian Analysis of Some Nonparametric Problems, the Annals of Statistics. (1973) 209-230.
[19] Bernardo, José M, and Adrian FM Smith, Bayesian theory, John Wiley & Sons. 405 (2009).
[20] D`Ambros, Marco, Michele Lanza, and Romain Robbes. An Extensive Comparison of Bug Prediction Approaches. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), IEEE. (2010) 31-41.
[21] Shirabad, J. Sayyad, and T. J. Menzies. The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa. (2005).
[22] Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research. 16 (2002) 321-357.
[23] Kraskov, Alexander, Harald Stögbauer, and Peter Grassberger, Estimating Mutual Information, Physical Review. 69(6) (2004) 066138.
[24] Ferri, Francesc J, Pavel Pudil, Mohamad Hatef, and Josef Kittler, Comparative Study of Techniques for Large-Scale Feature Selection, in Machine Intelligence and Pattern Recognition, North-Holland. 16 (1994) 403-413.
[25] Metz, Charles E. Basic Principles of ROC Analysis. In Seminars in Nuclear Medicine, WB Saunders. 8(4) (1978) 283-298.
[26] Fawcett, Tom, An introduction to ROC Analysis, Pattern Recognition Letters. 27(8) (2006) 861-874.
[27] Breiman, Leo, Random Forests, Machine Learning. 45(1) (2001) 5-32.
[28] Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone, Classification and Regression Trees, Routledge. (2017).
[29] Fenton, Norman E, and Martin Neil, A Critique of Software Defect Prediction Models, IEEE Transactions on Software Engineering. 25(5) (1999) 675-689.
[30] Menzies, Tim, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ay?e Bener, Defect Prediction from Static Code Features Current Results, Limitations, New Approaches, Automated Software Engineering. 17(4) (2010) 375-407.
[31] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al., Scikit-Learn: Machine Learning in Python, The Journal of Machine Learning Research. 12 (2011) 2825-2830.
[32] Herbold, Steffen, Autorank: A Python Package for Automated Ranking of Classifiers, Journal of Open Source Software. 5(48) (2020) 2173.