Chapter 2 Testing Analytics

2.1 Motivation

Testing is an important aspect in software engineering, as it forms the first line of defense against the introduction of software faults Pinto et al. [152]. However, in practice it seems that not all developers test actively. In this chapter we will survey on the use of testing and the tools that make this possible. We will also look into the future development of tools that is done or required in order to improve testing practices in real-world applications. Testing is not the holy grail for completely removing all bugs from a program but it can decrease the chances for a user to encounter a bug. We believe that extra research is needed to ease the life of developers by making testing more efficient, easier to maintain and more effective. Therefore, we wanted to write a survey on the testing behavior, current practices and future developments of testing. In order to perform our survey, we formulated three Research Questions (RQs):

RQ1 How do developers currently test?
RQ2 What state of the art technologies are being used?
RQ3 What future developments can be expected?

In this chapter we will first elaborate on the research protocol that was used in order to find papers and extract information for the survey. Second, the actual findings for each of the research questions will be explained.

2.2 Research Protocol

For this chapter, we applied Kitchenham’s survey method [104] was applied. For this method, a protocol has to be specified. This protocol is defined for the research questions given above. Below the inclusion and exclusion criteria are given, which helped finding papers. After these criteria, the actual search for papers is described. The papers that were found are listed and after they are tested against the criteria that are given. The data that is extracted from these papers are listed afterward. Some papers that were left out will be listed and the reasons for leaving them out will be given to make clear why some papers do not meet the desired requirements.

Each of the papers was tested using our inclusion and exclusion criteria. These criteria were introduced to make sure the papers have the information required to answer the RQs while also being relevant with respect to their quality and age. Below a list of inclusion and exclusion criteria is given. In general, for all criteria, the exclusion criteria take precedence over inclusion criteria. The following inclusion and exclusion criteria were used:

Papers published before 2008 are excluded from the research, unless a reference/citation is used for an unchanged concept.
Papers referring to less than 15 other papers, excluding self-references, are excluded from the research.
Selected papers should have an abstract, introduction and conclusion section.
Papers stating the developers’ testing behavior are included.
Papers stating the developers’ problems related to testing are included.
Papers stating the technologies related to testing analytics, which developers use, are included.
Papers writing about the expected advantage of current findings in testing analytics are included.
Papers with recommendations for future development in the software testing field are included.

The papers used in this chapter were found by using a given initial seed of papers (query defined below as ‘Initial Paper Seed’). From this initial seed of papers we used the keywords used by those papers to construct queries. Additionally, the references (‘referenced by’) and the citations (‘cited in’) of the papers were used to find papers. The query row of the tables describing the references, as found below, indicates how a paper was found. For queries the default search sites were Scopus,¹ Google Scholar² and Springer.³

The keywords used to construct queries in order to find papers were: software, test*, analytics, test-suite, evolution, software development, computer science, software engineering, risk-driven, survey software testing

The table below describes for each paper, which query resulted in which paper being found. Each of the papers is categorized with a corresponding research question. In the table below, the categories per paper were added based on their general topic. These broad topics will be assigned to a corresponding research question. Categorizations are based on the bullet points extracted from each paper. These bullet points can be found in the appendix of this chapter in section ‘Extracted paper information’.

Category	Reference	Query	Relevant to
Co-evolution	Greiler et al. [76]	In ‘cited by’ of “Understanding myths and realities of test-suite evolution” on Scopus	RQ2, RQ3
Co-evolution	Hurdugaci and Zaidman [89]	Keywords: Maintain developer tests, ‘cited by’ in “Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining” on IEEE	RQ2
Co-evolution	Marsavina et al. [122]	Google Scholar keywords: Maintain developer tests, in ‘cited by’ of “Aiding Software Developers to Maintain Developer Tests” on IEEE	RQ1
Co-evolution	Zaidman et al. [196]	Initial Paper Seed	RQ1
Production evolution	Eick et al. [66]	Referenced by: [111]	Discarded
Production evolution	Leung and Lui [111]	Initial Paper Seed	RQ3
Risk-driven testing	Atifi et al. [10]	In ‘cited by’ of “Risk-driven software testing and reliability”	RQ2, RQ3
Risk-driven testing	Hemmati and Sharifi [85]	In ‘cited by’ of “Test case analytics: Mining test case traces to improve risk-driven testing”	RQ3
Risk-driven testing	Noor and Hemmati [141]	Initial Paper Seed	RQ2, RQ3
Risk-driven testing	Schneidewind [167]	Scopus query: risk-driven testing	RQ3
Risk-driven testing	Vernotte et al. [189]	Scopus query: “risk-driven” AND testing	RQ2, RQ3
Test evolution	Bevan et al. [27]	Referenced by: [152]	Discarded
Test evolution	Mirzaaghaei et al. [135]	Google Scholar query: test-suite evolution	RQ2, RQ3
Test evolution	Pinto et al. [152]	Initial Paper Seed	RQ1
Test evolution	Pinto et al. [151]	Referenced by: [152]	RQ1
Test generation	Bowring and Hegler [35]	Springer: Reverse search on “Automatically generating maintainable regression unit tests for programs”	RQ2, RQ3
Test generation	Dulz [62]	Scopus query: “software development” AND Computer Science AND Software Engineering	RQ2
Test generation	Robinson et al. [162]	Referenced by [135]	RQ2
Test generation	Shamshiri et al. [169]	Google Scholar query: Automatically generating unit tests	RQ3
Testing practices	Beller et al. [26]	In ‘cited by’ of “Understanding myths and realities of test-suite evolution”.	RQ1
Testing practices	Beller et al. [22]	Initial Paper Seed	RQ1
Testing practices	Garousi and Zhi [71]	Google Scholar query: Survey software testing	RQ1
Testing practices	Moiz [136]	Springer query: software testing	RQ3

2.3 Results

In this section the research questions will be answered. To answer them, information from the relevant papers are aggregated. The answers to each research questions are summarized in the conclusion.

2.3.1 (RQ1) How do developers currently test?

To answer RQ1, “How do developers currently test?”, we first outline general test practices, then discuss the co-evolution of test and production code and finally, look into the use of Test Driven Development among developers.

2.3.1.1 How do we test?

For the quality of code, test coverage is a popular metric. Alternatives are, for example, acceptance tests, the number of defects in the last week, or defects per Line of Code (LOC) [71]. However, code coverage might not be the best indicator for the extensiveness of testing. For example, according to Beller et al. [23] a code coverage of 75% can possibly be reached with only spending less than a tenth of the total development time on testing. Another concern of using test coverage as a metric is the concept of treating the metric [34] , where developers try to uplift the value of code coverage by hitting many lines with only a few test cases. Marsavina et al. [122] observed that test cases were rarely updated when changes related to attributes or methods in the production code were made. Possible explanations for this are that these changes were not significant or the tests were too simple and were likely to pass. This also fits with the findings of Romano et al. [164], who claim that “[d]evelopers write quick-and-dirty production code to pass the tests, do not update their tests often, and ignore refactoring.”

Besides older tests rarely being updated for changed code, even new tests do not necessarily have the purpose of validating new production code lines. Pinto et al. [152] observed that a significant number of new tests that are added were not necessarily added to cover new code, but rather to exercise the changed parts of the code after the program is modified. This finding fits with the observation of Marsavina et al. [122], who found that test cases are created or deleted in order to address the modified branches whenever numerous condition related changes are conducted in the production code base. Older production code lines, therefore, may stay untouched by any test cases. Lines not covered by the test (such as any traditional code coverage tool would find them) should be indicated and signaled to the developer. Therefore, developers should be aware of the fact that they did not cover some lines of their production code with any tests. It seems to be a deliberate action by most developers to not cover older production code lines. These lines might be ‘too hard’ to test, other lines may be easier to test, or developers do not seem to see the relevance of testing these uncovered lines of code. However, the most commonly used coverage metrics are branch coverage and conditional coverage [71]. As both branch coverage and conditional coverage require multiple different conditions for if-statements, it may possibly be that the absolute number of missed lines of production code by tests is very low but rather the number of missed conditions is higher.

2.3.1.2 Co-evolution

In a case study conducted by Zaidman et al. [196], there was no evidence found for an increased activity of testing before a release.

However, the study detected periods of increased test writing activity. These increased activities of writing test cases were found to be after longer periods of writing production code [196]. With a longer timespan of not writing tests, it can be concluded for these cases that the production code and test code do not gracefully co-evolve [196] [122].

2.3.1.3 Test-Driven Development (TDD)

We found different definitions for TDD across multiple studies. According to Zaidman et al. [196], evidence of TDD was found where test code was committed alongside production code, meaning that the methodology of TDD is used when production code was written before the respective test code. This is in contrast with the originally proposed constraint by Beck [20], where a line of production code should only be written after a failing automated test was written in advance.

The confusion about the definition of TDD is also supported by the finding of Beller et al. [26], where programmers who claim they practice TDD neither follow it strictly nor practice it for all of their modification. A survey by Garousi and Zhi [71] on 196 respondents (amongst them managers and developers) indicated that a ratio of 3:1 use Test-last development and Test-driven development respectively. This ratio is in contrast with the numbers found by Beller et al. [26]; only 1.7% of the observed developers seemed to follow the strict TDD definition, where most of these developers only practice this strict definition in less than 20% of their time. However, it is important to mention that the survey done by Garousi and Zhi [71] only surveyed the subjects, which might allow the confusion for the definition of TDD to play a major role in the results.

2.3.2 (RQ2) What state of the art technologies are being used?

We will cover two research fields regarding testing analytics: test evolution and generation, and risk-driven testing.

2.3.2.1 Test Evolution and Generation

Pinto et al. [152] found the investigation of automated test repairing is not a promising research avenue, as these techniques – despite their name – would require manual guidance, which could end up being similar to traditional refactoring tools. Nonetheless, more research is performed in this field since then. An approach for automatically repairing and generating test cases during software evolution is proposed by Mirzaaghaei et al. [135]. This approach uses information available in existing test cases, defines a set of heuristics to repair test cases invalidated by changes in the software, and generate new test cases for evolved software. This properly repairs 90% of the compilation errors addressed and covers the same amount of instructions. The results show that the approach can effectively maintain evolving test suites and perform well compared to competing approaches.

While fully automated test suite generation can not replace human testing entirely yet, Bowring and Hegler [35] introduced a tool that generates the templates for tests, which guarantees compilation, supports exception handling and finds a suitable location for the test. Developers still need to fix the test oracles themselves, but the template is there. The technique looks at the context in order to decide what template to use. Robinson et al. [162] created a regression unit tests generation tool. It is a suite of techniques for enhancing an existing unit test generation system. The authors performed experiments using an industrial system. The generated tests from these experiments achieved good coverage and mutation kill score, were readable by the product developers and required few edits as the system under test evolved. Dulz [62] found that by directly adjusting specific probability values in the usage profile of a Markov chain usage model, it is relatively easy to generate abstract test suites for different user classes and test purposes in an automated approach. By using proper tools, such as the TestUS Testplayer, even less experienced test engineers will be able to efficiently generate abstract test cases and graphically assess quality characteristics of different test suites. Hurdugaci and Zaidman [89] introduce TestNForce, a tool to help developers identify unit tests that need to be altered and executed after code change (Visual Studio only).

2.3.2.2 Risk-driven Testing

The paper by Vernotte et al. [189] introduces and reports on an original tool-supported, risk-driven security testing process called Pattern-driven and Model-based Vulnerability Testing. This fully automated testing process, relying on risk-driven strategies and Model-Based Testing (MBT) techniques, aims to improve the capability of detecting various Web application vulnerabilities, in particular SQL injections, Cross-Site Scripting, and Cross-Site Request Forgery. An empirical evaluation shows that this novel process is appropriate for automatically generating and executing risk-driven vulnerability test cases and is promising to be deployed for large-scale Web applications.

A new risk measure is defined by Noor and Hemmati [141], which assigns a risk factor to a test case if it is similar to a failing test case from history. The new risk measure is by far more effective in identifying failing test cases compared to the traditional risk measure. Using this method for identifying test cases with a high risk factor, risky test cases can for example be run in the background while developing code, to find faults earlier. Furthermore, prioritizing risky tests while running the entire test-suite could make the suite detect failing tests earlier and the developer can start fixing the faulty code right away.

2.3.3 (RQ3) What Future Developments Can Be Expected?

This section will elaborate on which future developments can be expected in the field of software analytics.

2.3.3.1 Co-Evolution and Test Generation

For understanding how test- and production code co-evolve and how tests can be generated to support developers, several studies have been conducted [122, 152, 196]. Additionally, a tool has been made in order to analyze and, consequently, better understand test-suite evolution [151]. For the time being the practical implications of this subtopic have mainly been sought in the repairing and generation of tests.

According to Pinto et al. [152] test repairs occur often enough to justify the development and research for automated repair techniques. Mirzaaghaei et al. [135] argue that evolving test cases is an expensive and time-consuming activity, for which automated approaches reduce the pressure on developers. Shamshiri et al. [169] argue that automated generation of unit tests does not end up generating realistic tests and that the effectiveness of developers writing manual tests is equal to developers using automatically generated tests. Therefore, they call for the use of more realistic tests. This suggests that automated test generation is still a topic of future interest, which will likely be researched in order to find a way to generate realistic tests.

2.3.3.2 Risk-driven Testing

Risk-driven testing is an area of recent attention. Researchers have been looking for methods that can either detect potential risks within the same project [141] [85] [189] or that can detect risks based on models carried over from one project to another [111] [10]. These techniques have been implementing history based prediction approaches.

In the future, we can expect more interest and research into risk-driven testing as allocating testing activities effectively will remain important due to testing efforts and developer time being expensive. This area will likely stay in its research phase for the next couple of years as effective measures for risk prediction are still being researched. This goes for measures within the same project and cross-project prediction. Given that the currently researched techniques regard history based implementations, it is likely that these techniques will be subject to further research later on.

2.3.3.3 Testing Practices

Research of several papers [71] [22] [26] has indicated that testing of any form is not as widely practiced as the status quo suggests. How the current state of the practice will change depends on various developments within the field. Tools will be created to assist the developer in writing quality code and tests, such as TestEvoHound as suggested by M. Greiler. [76]. As automated test generation becomes more effective this may reduce the need for developers to spend a lot of time on writing and maintaining tests. With the development of risk-driven testing, developers may also be able to focus on the parts that are likely to be the most important to address, which could lead to better time allocation. The status quo for how much time is to be expected to be spent on testing may also change, given automated test repair and generation techniques become effective and accessible.

2.4 Conclusion

In this chapter, three different research questions about software testing analytics were answered. (RQ1) How do developers currently test? (RQ2) What state of the art technologies are being used? (RQ3) What future developments can be expected?

Regarding the current testing practices of developers (RQ1), we found that developers do not seem to update their tests very often and when they do, it is because of a changed condition in production code lines. Furthermore, older uncovered production code lines are not likely to be covered in the end. Developers, thus, seem to ignore indications of their code coverage tools or do not seem to use any code coverage tool at all. Furthermore, developers do not seem to put a lot of effort into making sure the co-evolution of their production- and test code is done gracefully. They do, on the other hand, make sure their test code compiles when production code classes have been removed. However, testing is mostly done in longer periods of increased testing. The methodology of TDD also seems to be a confusing term for developers, as there is not enough clear guidance in the implementation of it. The actual ratio of TLD and TDD is, therefore, unknown but can be guessed with great certainty to be much lower for TDD than for TDL.

The current state of the art in testing analytics (RQ2) consists of research in co-evolution and generation of tests, and risk-driven testing. Approaches are proposed for automatically repairing and generating test cases during software evolution. While fully automated test suite generation is not there yet, a tool is introduced that generates the templates for tests, which guarantees compilation, supports exception handling and finds a suitable location for the test. In the field of risk-driven testing, new risk measures are defined which make prioritizing certain high-risk tests able while running the entire test-suite, which could make the suite detect failing tests earlier.

For future developments (RQ3), further research can be expected on the front of automated test generation. Even with some discussion regarding the effectiveness of test generation, the field currently agrees that conducting research in order to find, especially, realistic ways of generating tests is worthwhile. We also found that risk-driven testing has been given more attention in the form of research recently. This subtopic is still in its research phase. It can be expected that research on the front of history based risk prediction methods will continue.

References

[152] Pinto, L.S. et al. 2012. Understanding myths and realities of test-suite evolution. Proceedings of the acm sigsoft 20th international symposium on the foundations of software engineering (2012), 33.

[104] Kitchenham, B. 2004. Procedures for performing systematic reviews. Keele, UK, Keele University. 33, 2004 (2004), 1–26.

[76] Greiler, M. et al. 2013. Strategies for avoiding text fixture smells during software evolution. IEEE international working conference on mining software repositories (2013), 387–396.

[89] Hurdugaci, V. and Zaidman, A. 2012. Aiding software developers to maintain developer tests. 2012 16th european conference on software maintenance and reengineering (March 2012), 11–20.

[122] Marsavina, C. et al. 2014. Studying fine-grained co-evolution patterns of production and test code. 2014 ieee 14th international working conference on source code analysis and manipulation (Sept 2014), 195–204.

[196] Zaidman, A. et al. 2011. Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining. Empirical Software Engineering. 16, 3 (2011), 325–364.

[66] Eick, S.G. et al. 2001. Does code decay? Assessing the evidence from change management data. IEEE Transactions on Software Engineering. 27, 1 (Jan. 2001), 1–12.

[111] Leung, H.K. and Lui, K.M. 2015. Testing analytics on software variability. Software analytics (swan), 2015 ieee 1st international workshop on (2015), 17–20.

[10] Atifi, M. et al. 2017. A comparative study of software testing techniques.

[85] Hemmati, H. and Sharifi, F. 2018. Investigating nlp-based approaches for predicting manual test case failure. Proceedings - 2018 ieee 11th international conference on software testing, verification and validation, icst 2018 (2018), 309–319.

[141] Noor, T.B. and Hemmati, H. 2015. Test case analytics: Mining test case traces to improve risk-driven testing. Software analytics (swan), 2015 ieee 1st international workshop on (2015), 13–16.

[167] Schneidewind, N.F. 2007. Risk-driven software testing and reliability. International Journal of Reliability, Quality and Safety Engineering. 14, 2 (2007), 99–132.

[189] Vernotte, A. et al. 2015. Risk-driven vulnerability testing: Results from eHealth experiments using patterns and model-based approach.

[27] Bevan, J. et al. 2005. Facilitating software evolution research with kenyon. ESEC/fse’05 - proceedings of the joint 10th european software engineering conference (esec) and 13th acm sigsoft symposium on the foundations of software engineering (fse-13) (2005), 177–186.

[135] Mirzaaghaei, M. et al. 2012. Supporting test suite evolution through test case adaptation. 2012 ieee fifth international conference on software testing, verification and validation (April 2012), 231–240.

[151] Pinto, L.S. et al. 2013. TestEvol: A tool for analyzing test-suite evolution. Proceedings - international conference on software engineering (2013), 1303–1306.

[35] Bowring, J. and Hegler, H. 2014. Obsidian: Pattern-based unit test implementations. Journal of Software Engineering and Applications. 7, 02 (2014), 94.

[62] Dulz, W. 2013. Model-based strategies for reducing the complexity of statistically generated test suites. International conference on software quality (2013), 89–103.

[162] Robinson, B. et al. 2011. Scaling up automated test generation: Automatically generating maintainable regression unit tests for programs. 2011 26th ieee/acm international conference on automated software engineering (ase 2011) (Nov. 2011), 23–32.

[169] Shamshiri, S. et al. 2018. How do automatically generated unit tests influence software maintenance? Software testing, verification and validation (icst), 2018 ieee 11th international conference on (2018), 250–261.

[26] Beller, M. et al. 2015. When, how, and why developers (do not) test in their ides. 2015 10th joint meeting of the european software engineering conference and the acm sigsoft symposium on the foundations of software engineering, esec/fse 2015 - proceedings (2015), 179–190.

[22] Beller, M. et al. 2017. Developer testing in the ide: Patterns, beliefs, and behavior. IEEE Transactions on Software Engineering. 1 (2017), 1–1.

[71] Garousi, V. and Zhi, J. 2013. A survey of software testing practices in canada. Journal of Systems and Software. 86, 5 (2013), 1354–1376.

[136] Moiz, S.A. 2017. Uncertainty in software testing. Trends in software testing. Springer. 67–87.

[23] Beller, M. et al. 2015. How (much) do developers test? Proceedings of the 37th international conference on software engineering - volume 2 (Piscataway, NJ, USA, 2015), 559–562.

[34] Bouwers, E. et al. 2012. Getting what you measure. Commun. ACM. 55, 7 (Jul. 2012), 54–59.

[164] Romano, S. et al. 2017. Findings from a multi-method study on test-driven development. Information and Software Technology. 89, (2017), 64–77.

[20] Beck, K. 2003. Test-driven development: By example. Addison-Wesley Professional.