Study of Data Mining Algorithms Using a Dataset from the Size-Effect on Open Source Software Defects

Document Type : Research Paper

Authors

Department of Computer Science, College of Computer Science and Information Technology, Kirkuk University, Kirkuk, Iraq.

Abstract

This article focuses on the quality of data mining algorithms in terms of the accuracy ratio and time consumption. So, in order to figure out the best algorithm among the classification and clustering algorithms, the WEKA program will be testing all algorithms using a real dataset from the size effect on defect proneness for open source software. The Mozilla product is adopted as an example of open source software. The dataset that is used in this paper represents the output of the study of the size effect on defect proneness in the open source software. The study of Mozilla product shows a significant relationship between the size of software and the number of defect proneness in software. The Mozilla product study produced a dataset to be as inputs of the WEKA program in order to compare the data mining tools (algorithms). We use the Naive Bayes, Decision Trees J48, Expectation-maximization for classifying and K-Star and Simple KMeans for clustering methods. The findings demonstrate the difference between the algorithms according to the accuracy, and the time consuming to reach the result in each algorithm. Furthermore, the effect of the software size is significant on defect proneness. Finally, the experiments are conducted in WEKA with the aim of this research is finding out the best algorithm in terms of accuracy and time-consuming. At the end, the paper will be figuring out the best algorithm in order to choose and depending on it in the tests of classification and clustering.

Keywords

Main Subjects


[1]   R. Cox, C. W. Revie, D. Hurnik, & J. Sanchez, "Use of Bayesian Belief Network techniques to explore the interaction of biosecurity practices on the probability of porcine disease occurrence in Canada", Preventive Veterinary Medicine, 131(1), 20 )2016(.
 
[2]   N. Dhanachandra, K. Manglem & Chanu, "Image Segmentation Using K -means Clustering Algorithm and Subtractive Clustering Algorithm",  Procedia Computer Science, 54 (1), 764  (2015).
 
[3]   M. Nilashi, O. bin Ibrahim, N. Ithnin, & N. H. Sarmin, "A multi-criteria collaborative filtering recommender system for the tourism domain using Expectation Maximization (EM) and PCA–ANFIS", Electronic Commerce Research and Applications, 14(6), 542 (2015).
 
[4]   S. Wang, L. Jiang, & C. Li, "Adapting naive Bayes tree for text classification", Knowledge and Information Systems, 44(1), 77 (2015).
 
[5]   Y. Zheng, B. Jeon, D. Xu, Q. M. Wu, & H. Zhang, "Image segmentation by generalized hierarchical fuzzy C-means algorithm", Journal of Intelligent & Fuzzy Systems, 28(2), 961 (2015).
 
[6]   J. L. Puga, M. Krzywinski, & N.Altman, " Points of significance: Bayes theorem", Nature Methods, 12 (4), 277 (2015).
 
[7]   Dai, W., & Ji, W. "A MapReduce implementation of C4.5 decision tree algorithms", International journal of database theory and application, 7(1), 49 (2014).
 
[8]   T. Menzies, J. Greenwald, and A. Frank, "Data Mining Static Code Attributes to Learn Defect Predictors", IEEE Transaction Software Enginering, 33(1),  2 ( 2007).
[9]   Brian Fitzgerald, "The Transformation of Open Source Software", Management Information system Quarterly, 30(3), 587 (2006).
 
[10]      J. Padhye, V. Firoiu, and D. Towsley. "A stochastic model of TCP Reno congestion avoidance and control Technical Report", 1st Ed., Graduate Research Center Amherst, University of Massachusetts, USA (1999).
 
[11]      G.Q. Kenney, "Estimating Defects in Commercial Software during Operational Use", IEEE Transaction Reliability, 42(1), 107 (1993).
 
[12]      Dayana Hernandez, "An Experimental Study of K* Algorithm", International Journal of Information Engineering and Electronic Business, 2(1), 14 (2015).
 
[13]      Qinbao Song, Zihan Jia, Martin Shepperd, Shi Ying, and Jin Liu, " A General Software Defect-Proneness Prediction Framework", IEEE Transactions On Software Engineering,  37(3), 356 (2011).
 
[14]      Trung T. Dinh-Trong and James M. Bieman. "A Replication Case Study of Open Source Development", The FreeBSD Project, IEEE Transactions on Software Engineering, 31(6), 481 (2005).
 
[15]      P. Runeson, C. Andersson, T. Thelin, A. Andrews, and T. Berling, "What Do We Know about Defect Detection Methods", IEEE Transactions on Software Engineering, 23(3), 82  (2006).
 
[16]      S. and M. Muthulakshmi, "Comparative Analysis of Bayes and Lazy Classification Algorithms", International Journal of Advanced Research in Computer and Communication Engineering, 2(8), 3118 (2013).
 
[17]      Michael Goulde and Eric Brown, "Open Source Software: A Primer for Health Care Leaders", 3rd International Workshop on Predictor Models in Software Engineering, USA, Conference 10, 1109 (2007).
 
[18]      Lan H. Witten and Epik Frank, "Data Mining: Practical Machine Learning tools and techniqueswith Java Implementations", Association Computing Machinery, 31(1),  338 (2002).
 
[19]      Lawrence A. Birnbaum‏, "Machine Learning Proceedings", 10th Ed., Morgan Kuafman Publishers, USA (1993).