Abstract:
In this dissertation, we make an analysis of software defect prediction problem from a data mining perspective, where software characteristics are represented with static code features and defect predictors are learned from historical defect logs. We observe that straightforward applications of data mining methods for constructing defect predictors have reached a performance limit due to the limited information content in static code features. Therefore, we aim at increasing the information content in data without introducing new features, since collecting these may either be expensive or not possible in all contexts. We feed data mining methods with richer data in terms of information content. For this purpose, we propose the following methods: 1) relaxing the assumptions of data miners, 2) using project data from multiple companies, 3) modeling the interactions of software modules. For the first method, we use naive Bayes data miner and remove its i) independence and ii) equal importance of features assumptions. Then we compare the performance of defect predictors learned from local and remote data. Finally, we introduce call graph technique to model the interactions of modules. Our results on public industrial data show that: 1) relaxing the assumptions of naive Bayes may increase defect prediction performance significantly, 2) predictors learned from remote data have great capability of detecting defects at the cost of high false alarms, however this cost can be removed with the proposed filtering method 3) proposed way of modeling interactions may decrease the false alarm rates significantly. Our techniques provide guidelines for 1) employing defect prediction using remote information sources when local data are not available, 2) increasing prediction performances using local information sources.