An Industrial Application of Mutation Testing

An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions, Ivankovic et al., 2018.

Mutation testing is a technique used to assess test suite effectiveness. The concept is to insert small faults into a program and measure if the test suite detects the changes. The more faults a test suite detects, the more effective it is.

The paper outlines lessons learned from applying large-scale mutation testing at Google. It is an interesting read to learn how Google have integrated mutation testing into their code review workflow.

I would love to see a mutation testing framework integrated to the code review process of open source projects. Imagine having an automated system suggest a test that you should add to make the test suite better for the code that you are submitting in your next pull request on Github.

Productive Mutants

A program that has been modified in order to yield a new program with a fault is called a mutant. If any of the tests in the test suite fails for the mutant, then the test suite has detected the fault and the mutant is denoted to have been killed.

A mutant that isn't detected by the test suite is considered to be killable since it survived. A test which exercises the fault can thus be added to the test suite to improve its effectiveness.

Although if a mutant isn't detected by the test suite, it doesn't necessarily correspond to a productive test.

Since mutation analysis generates a large number of mutants it is of interest to suppress unproductive mutants to make mutation testing a viable and effective technique.

An example of an unproductive mutant described in the paper is a string message associated with a logging statement. Mutating the program by replacing the message with an empty string usually results in a mutant that isn't detected by the test suite, but adding a test which exercises the fault isn't considered effective.

The paper proposes a notion of productive mutants:

A mutant is productive if 1) the mutant is killable and elicits an effective test, or 2) the mutant is equivalent but its analysis advances knowledge and code quality.

The first lesson is simply that some mutants should be ignored.

Mutation Testing in Practice

According to the paper, mutation testing is supported for nine languages at Google, which includes C++, Go, Java, and Python.

Mutation testing at Google is integrated in the code review process, which requires approval from reviewers with ownership and expertise in the language to submit the change.

The code review system surfaces mutants that wasn't detected by the test suite to the author and the reviewers.

The author may choose to kill a mutant by adding a test case, to change the code, or to ignore that mutant.

The author and the reviewers can provide feedback whether a mutant was useful. The provided feedback is used to determine if a mutant is considered productive in future source code changes.

The code review system limits the number of mutants surfaced per added or changed line in a commit. There's also an upper bound on the total number of mutants presented per commit. The rationale being being that the number of mutants shouldn't clutter or add too much friction to the code review workflow.

The combination of commit-level, selective mutation and unproductive mutant suppression make mutation testing practical in a large-scale software development environment like Google, [...]

These kind of trade-offs are important and necessary factors to make mutation testing feasible at Google. The opportunity lies in selecting which mutants to present to the author and the reviewers.

Considering a code repository of 2 billion lines of code and 40,000 commits every day [8], we argue that aiming at mutation adequacy is hopeless.

Based on a study of 1.9 million commits written in four programming languages by more than 30,000 programmers, the paper states that:

Commit-level mutation testing does not add a significant overhead compared to using code coverage analysis in a commit-oriented code-review process. This observation holds for all studied programming languages.

The second lesson is that the unit of work matters. Mutation testing shouldn't add overhead to the code review workflow for it to be productive.

The Costs of Unproductive Mutants

To understand the cost of killing all mutants reported for a program, the paper describes a case study of how they analyzed three real faults from an open source project where the test suite had at least 95% statement coverage.

Achieving a mutation-adequate test set, however, was not a trivial task. Manual review of the mutants not killed by developer or generated tests required an average of 4.6 minutes per mutant, for a total time of 32.8 hours.

Overall, our results are consistent with prior work: the average time to examine each of the first 10 equivalent mutants from each subject (when familiarity with the code was not yet established) was 11.7 minutes.

It's hard work. The third lesson is that killing all mutants is too expensive.