Mutation Testing

Mutation Testing is an advanced form of software testing that is used to interrogate the sensitivity of a test suite. The core idea behind mutation testing is that if a test suite is sensitive to errors, it should be able to detect intentionally-introduced changes to the source code under test.

Mutation operators are used to create mutant versions of the code under test. Mutants tend to be simple: for example changing bounds checks (e.g., a >= arr.length to a > arr.length), or boolean values (e.g., if (isReady) {} to if (!isReady) {}). Mutation testing frameworks are able to generate tens of thousands of mutants on any program; the main problem in mutation testing is not generating mutants but in generating mutants that are likely to fail. This is because each mutant needs to be evaluated independently. The implication of this is that if you have 1,000 mutants and 1,000 test cases you must run 1,000,000 test cases to evaluate every mutant for every test case.

The core idea behind mutation testing is known as the coupling hypothesis that posits that human-introduced faults are coupled (similar) to the kinds of mutants that can be programmatically generated by a mutation testing tool. While it may feel unsatisfying that the complex faults a person could introduce could be replicated by simple mutants, this has been demonstrated to be true.

The main process of mutation testing involves:

  1. Run the test suite and ensure all tests pass on an un-mutated version of the system.
  2. Perform the following loops for many mutants:
    1. Introduce a single mutation into the program.
    2. Run the test suite again and see if any tests fail. If a test fails, the mutant is said to be killed. If no tests fails, the mutation is said to survive.
  3. The mutation kill score of a test suite is expressed as the fraction of mutants that are killed: kill score = # mutants killed / (# mutants killed + # mutants survived)

The main goal of mutation testing is to provide insight into the quality of a test suite. This can be handy for understanding the fault-detection ability of a manually-created suite, but also for comparing different automatically-generated test suites against one another.