Machine learning can be used to predict how tests behave on changes in the code. These predictions reduce the feedback time to developers by providing information at check-in time. Marco Achtziger & Dr. Gregor Endler presented how they are using machine learning to learn from failures at OOP 2020.
Determining the real impact of a software change can be hard. Machine Learning is a rather good fit for the problem of identifying the impact code changes have on test results, as Endler explained:
There’s a large amount of data that automatically arises from the software development and testing process. What’s more, we don’t even need to manually annotate the training data, since what we are trying to predict (the results of test cases) is present in the historical data anyway.
They use data from the software development project to shed light on non-obvious dependencies in the software and to warn about possibly impacted test cases ahead of time. The feature vectors are formed of the metadata about commits from the source control system. The class to predict is the result of a test, taken from the test execution data. They bundle by test case, having one model for each test case and ML algorithm.
How do developers feel about machine learning predicting if their code will make tests fail? Achtziger mentioned the feedback that he got from testers and developers using the system:
There are quite a few people who see the benefit of this approach and use the provided information. However, there are of course people who do not like that a system tells them that their changes are not fully correct. But they can be convinced by showing them the benefits of the additional input and that it is not about them making mistakes, but about learning from the information we already have.
InfoQ interviewed Marco Achtziger and Dr. Gregor Endler after their talk at OOP 2020.
InfoQ: What made you decide to use machine learning for determining the impact of software changes?
Marco Achtziger: In the project I am working on, we maintain and evolve a huge software system which is composed of legacy components and newly developed parts. Of course, we also identify the commodity parts (areas in our system we do not want to change without good reason) to avoid that our developers work too much in areas that are not part of the innovation we need for our customers. But we wanted to make sure that coding does not produce unwanted side effects if a developer has to work in code in the commodity area or in a legacy part of the system (where typically code is not so often changed and impact can be harder to predict).
Having said that, we saw a need to make the test impact prediction more efficient. Since we gather the data of our software development anyway, we thought about utilizing this. As we talk about tons of data, it is simply not possible anymore to do that with humans and that is how using techniques like machine learning came into the picture.
Gregor Endler: As Marco already mentioned, the problem is difficult for humans to solve – if test results always were clear, you wouldn’t need to execute the test cases anymore.
InfoQ: What different approaches did you take and how did that work out?
Achtziger: First we tried it very naively by simply throwing all of our data at the algorithms. Of course, we put quite some effort into structuring that and make dependencies between the data transparent. But in the end, it turned out that this was not really hitting the point. So we had to start over again and first think about what we want to get answered. And that is basically how tests behave on changes in the code. That caused us to revisit our data and decide what was needed for that. In the end, we even threw out some data sources and only used two of the original ones we thought that would be needed. Doing it like this we came to a much simpler interconnection of data that could answer our questions.
Endler: The first approach tried, among other things, to take into account architectural dependencies between pieces of software by building a graph of connected architectural units and associated test results and code changes. Graph algorithms and Bayesian reasoning were then employed to predict test results. While this delivered an interesting view of the application under scrutiny, unfortunately, the predictive power was lacking – the majority of results were false positives. Nevertheless, we still believed in the fundamental approach – so we had another look at the data collection and prediction process and decided on a “more traditional” stab at the problem: using supervised machine learning for a classification task. To evaluate how well the task of predicting test results could be done, we experimented with different data scenarios and various algorithms, e.g. decision trees, random forests, neural networks, and the like.
InfoQ: What data do you use to predict what test cases might fail?
Achtziger: In the end, we came to the conclusion that we only need the test execution data (what outcome a specific test showed on a specific source code revision) and the source control metadata itself. So something which seems to be a natural fit like defect/change management data is currently not used at all, for example. The benefit is that the extraction of the needed information is quite simple on these two data sources.
Endler: From a machine learning perspective, the feature vectors necessary for supervised learning are formed of the metadata about commits (e.g. number of files or the average age of files, to name just two). This is taken from the source control system. The class we want to predict is the result of a test, taken from the test execution data. We bundle these data by test case, meaning we get one model for each test case and ML algorithm we evaluate.
Features arise from domain knowledge and intuition, e.g. “larger code changes can break more things”. At this point, the features are just descriptions of a commit, though. Whether they can explain a certain test case’s results depends on the other features present, the total training data, and the algorithms used for learning, and is reviewed by evaluating the machine learning system with separate test and training data.
InfoQ: What do developers and testers think about these predictions?
Endler: One interesting thing to note is the different questions we get when presenting the system at conferences. At a tester-focused conference, questions were (as expected) test case focused, e.g. “Can we judge a test case’s quality by whether it is easily predictable or not”. Developers, on the other hand, tend to ask about the times of feedback cycles and more implementation focused things like data integration. Overall, the feedback we get after talks tends to be very positive.
InfoQ: What benefits do you get from predicting failing test cases before check-in?
Achtziger: Our main focus currently is the reduction of feedback time to developers regarding their changes. Unfortunately, we have a system where getting feedback from test execution for a particular change can take three to four days (or even more). So there is an obvious benefit in providing this information already at check-in time.
Endler: Even if test execution “only” takes hours, the system can help reduce the time spent re-familiarizing oneself with one’s own code by getting information to developers much quicker. Getting predictions from the finalized models only takes a couple of seconds.
Training the system of course takes its time, but this is done offline and can for example run over night. This also makes it possible to frequently retrain, which is necessary since the behaviour of the tests may change. For example, a test may stabilize since the code region it pertains to gets changed less over time.
There are also some interesting “secondary” benefits to the data integration process necessary for our system, as it delivers quite a lot of information about the way tests and commits interact in a project, e.g. about flaky tests.
Credit: Google News