Chapter 4 Bug Prediction

4.1 Motivation

Minimizing the number of bugs in software is an effort central to software engineering — faulty code fails to fulfill the purpose it was written for, its impact ranges from slightly embarrassing to disastrous and dangerous, and last but not least — fixing it costs time and money. Resources in a software development life cycle are almost always limited and therefore should be allocated to where they are needed most — in order to avoid bugs, they should be focused on the most fault-prone areas of the project. Being able to predict where such areas might be would allow more development and testing efforts to be allocated on the right places.

However, reliably predicting which parts of source code are the most fault-prone is one of the toughest problems of software engineering [65]. Thus it is not surprising that bug prediction continues to garner a widespread research interest in software analytics, now equipped with the ever-expanding toolbox of data-mining and machine learning techniques. In this survey, we investigate the current efforts in bug prediction in the light of the advances in software analytics methods and focus our attention on answering the following research questions:

RQ1 What is the current state of the art in bug prediction? More specifically, we aim to answer the following:
- What software or other metrics does bug prediction models rely on and how good are they?
- What kind prediction models are predominantly used?
- How are bug prediction models and results validated and evaluated?
RQ2 What is the current state of practice in bug prediction?
- Are bug prediction techniques applied in practice and if so, how?
- Are the current developments in the field able to provide actionable tools for developers?
RQ3 What are some of the open challenges and directions for future research?

4.2 Research protocol

We started by studying the initial 6 seed papers which were selected based on domain knowledge:

Reference	Topic	Summary
[77]	metrics validation	performance of object-oriented metrics
[39]	literature review	comparison of metrics, methods, datasets
[9]	literature review	comparison of models, metrics, performance measures
[64]	BP performance	benchmark for bug prediction, evaluation of approaches using the benchmark
[78]	literature review	influence of model context, methods, metrics on performance
[112]	case study	BP deployment in industry

We looked for additional papers with the following queries:

Keyword search using search engines (Google Scholar, Scopus, ACM Digital Library, IEEE Explorer). The search query was constructed so that the paper had to contain the phrase bug prediction, but also the other more general variants used in literature: bug/defect/fault prediction. The paper also had to contain at least one of following keywords: metrics, models, validation, evaluation, developers.
Filtering search results by publication date. We narrow the scope to recent papers, which we define as “published within the last 10 years” for the purposes of this review. Therefore we excluded papers published before 2008. However, there were a few exceptions for papers which had very high impact.
Filtering by the number of citations. We selected papers with 10 or more citations in order to focus on the ones that already have some visibility within the field. Also, there were some exceptions for very recently published papers.
Exploring other impactful publications by the same authors.

The papers chosen had to fulfill the above criteria.

Table 2. Papers found by investigating the authors of other papers.

Starting paper	Relationship	Result
[64]	is author of	[65]
[39]	is author of	[38] [40]
[158]	is author of	[157]
[74]	is author of	[73]

4.3 Answers

4.3.1 RQ1: Metrics

To find out what metrics are commonly used to build bug prediction models, we studied the papers that either focus their entirety or at least a section to the discussion of software metrics used for bug prediction. Based on the results and observations made in the relevant studies, we classify the software metrics most commonly used in two groups based on which aspect of a piece of software they measure:

The product: the (static) code itself.
The process: the different aspects that describe how the product developed through time. We include both simpler change log based metrics that require different versions as well as metrics that require detailed process recording.

4.3.1.1 Product metrics

Different metrics incur different costs based on whether they require additional efforts in order to collect data and setup an instrumentation infrastructure. Measuring various aspects of what is already available — a snapshot of source code — is an obvious starting point when building a set of predictive features and it turns out code metrics were the most commonly studied metrics in literature [39]. Traditionally, examples of metrics obtained by static analysis include lines of code (LOC) as well as code complexity (for example, McCabe and Halstead complexities), with the latter applicable for languages with a structured control flow and the concept of methods. The rationale behind using code complexity as a metric for bug prediction is that if code is complex, it is difficult to change and is therefore bug-prone [65].

Another group of more language specific metrics which became popular after 1990 are the object-oriented or class metrics which are based on the idea of classes and measure class-related concepts like amount of methods, coupling, inheritance and cohesion [77].

A drawback of static code metrics is that that such metrics have high stasis, which means they do not change a lot from release to release [157]. In turn, this can cause stagnation in the prediction models, which classify the same files as bug-prone over and over.

4.3.1.2 Process metrics

The other prominent group of metrics are the process metrics which are related to the software development process and the changes of the software through time. Additional infrastructure has to be in place in order to capture the process features and typically at least a version control system is required. One of the rationales behind using these metric in a predictive model is that bugs are introduced by changes in software and should be studied [65].

Some examples of process metrics include code churn, number of revisions, active developer count, distinct developer count, changed code scattering, average lines of code added per revision and age of a file [137]. There are also metrics based on prior faults, that argue files which in the past contained faults will probably have more faults in the future. It is shown that predictors based on previous bugs show better results than predictors based on code changes [158].

Other metrics are derived by looking at the source code change history as either delta metrics or code churn [155]. Delta metrics are not separate new metrics per se, but are derived from other metrics by comparing versions of software to each other.

Some studies found that process metrics outperform code metrics by comparing prediction results of both metrics [137]. It was also shown that process metrics are more stable across releases [157]. However, note that such results should be taken with a grain of salt; it was found that when comparing a selection of representative bug prediction approaches, the results considering metrics highly depend on the choice of learner and could not necessarily be generalized [65].

There are potential advantages of using more fine-grained source code changes, capturing them at statement level and including the changes’ semantics [74]. This data can be obtained by building abstract syntax trees (ASTs) and using the changes required to transform one AST to the other as metrics. These models were found to outperform models based on other coarser metrics such as code churn; however, the gain in performance comes at the cost of extracting these fine-grained changes.

4.3.1.3 Developer-based metrics

Besides metrics from code repositories we can also extract some metrics from developers itself. It was shown that modules touched by more developers are more likely to contain more bugs [124]. Also, using developer-based metrics in combination with other metrics can improve prediction results [124]. Furthermore, it turns out that developer experience has no clear correlation with how many faults they introduce [158].

Faults can also be predicted by tracking micro-interactions of developers [109]. The authors find that the most informative metrics are the number of low-degree-of-interest-file editing events, editing and selecting bug-prone files consecutively, and time spent on editing. They find that their prediction results exceed those of some product and process metrics.

Another set of metrics are scattering metrics, which uses data about how focused a developers work is [56]. Structural scattering describes how structurally far apart in the project the code is. Semantic scattering measures how different the code being edited is in terms of implemented responsibilities, this is measured by textually comparing the code. The authors find they have relatively high accuracy and perform better than a baseline selection of code and process metrics.

4.3.2 RQ1: Models

In the section above we have seen different kinds of metrics, in this section we will show how these metrics are used to create models for bug prediction which can actually be used. We will also show some preliminary results of these models, however, as we will explain later on, it is not trivial to compare the results of different models.

4.3.2.1 History Complexity Metric [80]

Files that are modified during periods of high change complexity will contain more faults. Developers changing code during these periods will make mistakes, because they probably will not be aware of the current state of the code. This method is useful in practice because companies may not have a full bug history, while they probably do have history of code changes. This method outperforms models based on prior faults, at a 15-38% decrease in prediction errors. However, D’Ambros et al. contradict this statement [64].

4.3.2.2 FixCache [102]

Maintains a fixed-size cache of files that are most likely to contain bugs. Files are added to cache if they meet one of the locality criteria, which are:

Churn locality: if a file is recently modified, it is likely to contain faults
Temporal locality: if a file contains a fault, it is more likely to contain more faults
Spatial locality: files that change alongside faulty files are more likely to contain faults

If the cache is full the least recently used file is removed from the cache. The size of the cache is set to 10% of all files. The hit rate of the cache is about 73-95% at file level and 46-72% at method level, shwoing the practical performance gains achieved by having a cache.

4.3.2.3 Rahman [158]

The Rahman algorithm is based on the FixCache algorithm. In their research they found that the temporal locality was by far the most influential factor. The Rahman algorithm is thus implemented based solely on that factor.

4.3.2.4 Time-weighted risk algorithm [112]

An algorithm based on the Rahman algorithm, it includes a weight factor based on how old a commit is, older bug fixing commits have less impact on the overall bug-proneness of the file. Instead of showing top 10% of bug-prone files, this shows the top 20 files.

4.3.2.5 Machine learning

With machine learning methods bugs can be predicted by software tools, these tools train on the code base, code history and other data from software repositories. Downside is that these methods give little insights in why a component is bug-prone [112]. Another downside is that this does not work for new projects since cross-project defect prediction does not yet give good results [201].

4.3.2.6 Analysis on Abstract Syntax Trees

Because ASTs give a more high-level description of the syntax of a program it could give more useful insights for bug prediction than other analysis methods. Downside of ASTs is that dynamic programming languages cannot produce static ASTs. Wang et al. uses ASTs in combination with a machine learning approach for bug prediction. They also show that using ASTs can improve cross-project prediction results [191].

4.3.3 RQ1: Evaluations

In order to answer these questions we looked at the scientific papers that benchmark prediction models in any way, and specifically looked at the way these are evaluated. We found that most papers use a simple yet logical approach to benchmark an algorithm’s performance. This approach is called Area Under the Curve (commonly referred as AUC), it consists of measuring the area under a curve which is known as Receiver Operating Characteristic (ROC). This process graphically draws the relationship between the accuracy, which is the ability of an algorithm to find all existing bug-prone files, and precision, which is how accurate the algorithm is at finding them (optimal would be no false negatives) [65].

Other authors use different methodologies in order to rank the algorithm’s performance. Jiang et al. focuses solely on how to evaluate the algorithm’s performance, and the researchers used the accuracy and precision metrics in numerical and graphical form, but they also bring to the table a graphic that helps identify where to spend the project resources in order to get a greater software quality [93]. Finally, another method is to use linear regression in order to qualify the fault prediction method’s prediction power [64].

In conclusion, we can see that the most used methodology is AUC, this is because, citing Lessman et al.: “The AUC was recommended as the primary accuracy indicator for comparative studies in software defect prediction since it separates predictive performance from class and cost distributions, which are project-specific characteristics that may be unknown or subject to change.” [110].

The conclusion of Shepperd et al. is also worth noting, in which they found that 30 percent of the variance of the algorithm’s performance can be “explained” based on the research group, while the variability due to the choice of classifier is very small [170].

4.3.4 RQ2: Practical usage

There seems to be none to very little usage of bug prediction tools in the industry, even searching outside the scope of the scientific literature we did not come up with anything. On GitHub we could only find a handful of repositories which implement a bug prediction tool, however, none of these are actively maintained any longer.

Furthermore, all the tools we did find were based on references [112] or [158]. Lewis et al. even mentions that their tool had no significant effect on developers [112]. Kim et al. says software managers could use a list of bug-prone files to allocate more quality assurance in some parts of the code, but again, no industry usage could be found [102].

We did find some reasons which could explain why there is no practical usage of bug prediction yet. Some issues we found with using bug prediction tools in practice have to do with the following properties that bug prediction tools should have in some cases:

Obvious reasoning: If a tool has no clear reasoning users might quickly discard the tool, because they do not understand why a file is bug-prone [112].
Bias towards the new: For some metrics this feature is required, because otherwise it would be impossible to ‘fix’ a file, it will remain to be bug-prone even if completely rewritten [112].
Actionable messages: Messages shown by bug prediction tools must be actionable. The user must be able to perform an action to fix the problem with the file, otherwise users will not know what to do with the tool [112].
Noise problem: It turns out that if data which is used for the historical changes contains noise this has a large impact on the performance of the model. Since the data is mostly mined from software repositories this will contain noise [101].
Granularity: Most methods we found work on a class- or file-based granularity, having a lower level of granularity could improve the usefulness of the prediction for developers [73].

4.3.5 RQ2: Actionable tools for developers

The only paper that tackles this question is by Lewis et al., in which the authors deploy bug predicting software to developers workspaces at Google and evaluate developers behavior after and before having the software [112]. In this study they conclude currently, fault prediction does not provide actionable tools for developers.

To answer this questions we found no more information in scientific papers, so we had to look outside of the academic world in order to see if this techniques are being used by developers. After some research we concluded that bug prediction is not currently being used on developers, and we, as in reference [112], think that maybe bug prediction is supposed to be used for software quality in order to find the places where we should invest the project resources instead of helping developers write code without bugs.

4.3.6 RQ3: Open challenges and future work

We believe that the bug prediction field still has a lot of open challenges, for example, the use of developer-related factors in the predictions. Still no strong conclusions have been found, so this field requires a lot of further investigation to be able to find better models and useful applications of these techniques. All these challenges are anchored in a very hard to prove external validity.

In conclusion, bug prediction is still a partially explored area, and even though some algorithms have proven to be successful to some degree, we still lack more evidence and more reliable algorithms so they can be applied in the real world.

References

[65] D’Ambros, M. et al. 2012. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empirical Software Engineering. 17, 4-5 (2012), 531–577.

[77] Gyimothy, T. et al. 2005. Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering. 31, 10 (Oct. 2005), 897–910.

[39] Catal, C. and Diri, B. 2009. A systematic review of software fault prediction studies.

[9] Arisholm, E. et al. 2010. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software. 83, 1 (2010), 2–17.

[64] D’Ambros, M. et al. 2010. An extensive comparison of bug prediction approaches. Proceedings - International Conference on Software Engineering. (2010), 31–41.

[78] Hall, T. et al. 2012. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering. 38, 6 (Nov. 2012), 1276–1304.

[112] Lewis, C. et al. 2013. Does bug prediction support human developers? Findings from a Google case study. 2013 35th international conference on software engineering (icse) (May 2013), 372–381.

[38] Catal, C. 2011. Software fault prediction: A literature review and current trends. Expert Systems with Applications. 38, 4 (2011), 4626–4636.

[40] Catal, C. and Diri, B. 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences. 179, 8 (2009), 1040–1058.

[158] Rahman, F. et al. 2011. BugCache for inspections: Hit or miss? Proceedings of the 19th acm sigsoft symposium and the 13th european conference on foundations of software engineering (New York, NY, USA, 2011), 322–331.

[157] Rahman, F. and Devanbu, P. 2013. How, and why, process metrics are better. 2013 35th international conference on software engineering (icse) (May 2013), 432–441.

[74] Giger, E. et al. 2011. Comparing fine-grained source code changes and code churn for bug prediction. Proceedings of the 8th working conference on mining software repositories (New York, NY, USA, 2011), 83–92.

[73] Giger, E. et al. 2012. Method-level bug prediction. Proceedings of the acm-ieee international symposium on empirical software engineering and measurement (New York, NY, USA, 2012), 171–180.

[137] Moser, R. et al. 2008. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. Proceedings of the 30th international conference on software engineering (New York, NY, USA, 2008), 181–190.

[155] Radjenović, D. et al. 2013. Software fault prediction metrics. Information and Software Technology. 55, 8 (Aug. 2013), 1397–1418.

[124] Matsumoto, S. et al. 2010. An analysis of developer metrics for fault prediction. Proceedings of the 6th international conference on predictive models in software engineering (New York, NY, USA, 2010), 18:1–18:9.

[109] Lee, T. et al. 2011. Micro interaction metrics for defect prediction. Proceedings of the 19th acm sigsoft symposium and the 13th european conference on foundations of software engineering (New York, NY, USA, 2011), 311–321.

[56] Di Nucci, D. et al. 2018. A developer centered bug prediction model. IEEE Transactions on Software Engineering. 44, 1 (2018), 5–24.

[80] Hassan, A.E. 2009. Predicting faults using the complexity of code changes. Proceedings of the 31st international conference on software engineering (Washington, DC, USA, 2009), 78–88.

[102] Kim, S. et al. 2007. Predicting faults from cached history. Proceedings of the 29th international conference on software engineering (Washington, DC, USA, 2007), 489–498.

[201] Zimmermann, T. et al. 2009. Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. Proceedings of the the 7th joint meeting of the european software engineering conference and the acm sigsoft symposium on the foundations of software engineering (New York, NY, USA, 2009), 91–100.

[191] Wang, S. et al. 2016. Automatically learning semantic features for defect prediction. 2016 ieee/acm 38th international conference on software engineering (icse) (May 2016), 297–308.

[93] Jiang, Y. et al. 2008. Techniques for evaluating fault prediction models. Empirical Software Engineering. 13, 5 (Oct. 2008), 561–595.

[110] Lessmann, S. et al. 2008. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering. 34, 4 (2008), 485–496.

[170] Shepperd, M. et al. 2014. Researcher bias: The use of machine learning in software defect prediction. IEEE Transactions on Software Engineering. 40, 6 (June 2014), 603–616.

[101] Kim, S. et al. 2011. Dealing with noise in defect prediction. Proceedings of the 33rd international conference on software engineering (New York, NY, USA, 2011), 481–490.