"How useful is code coverage" - gamification metrics and code quality

Anyone who has ever written tests in JUnit, PHPUnit, PyUnit or Karma knows the joy of hitting that 100% code coverage mark, reflected in reports presented on their CI tool, or otherwise generated directly by Clover or Cobertura. It's a rewarding feeling to be sure...but the terrifying question is just how value this 100% code coverage figure is to the quality of our software product. 

To objectively answer that question, we must consider:

  1. What does code coverage do (what does it measure)? 
  2. What are we trying to achieve through testing? 
  3. How do we achieve this? 
  4. How does code coverage [help] measure our success?
  5. Are there other methods/metrics which give us a more concise measure (either independently, or in concert with code coverage)?

Code coverage, in short, is the degree of code under test that was executed during testing. Consequently, in order to achieve 100% coverage, we need to ensure that all of our source code under test is executed during testing.

That seems a natural consequence of writing tests for all of our code, surely? Let's look at the example of PHP code (in a Clover HTML code coverage report) and associated unit tests:

class TestableTest extends PHPUnit\Framework\TestCase
{

    public function testItReturns1WhenAnIntegerIsGiven()
    {
        $this->assertSame(1, (new Testable())->doSomething(5));
    }

    public function testItReturnsAStringWhenANonIntegerValueIsGiven()
    {
        (new Testable())->doSomething('arbitrary string');
    }

}

Notice that the second test simply executes the method under test without any assertions, in contrast to the first. However, this achieves full code coverage.

This brings us to the question of what we aim to achieve. Fundamentally, our automated unit and lower-level integration tests aim to validate the logic of our software against the corresponding requirements. This is certainly true of the first test...but the second has identified a deep disconnect between the code coverage metric and our goal.

This phenomenon exemplifies the mantra that *what you test is more important than how you test*. We've already established that code coverage measures the "how much", but not necessarily the what. That said, the lack of coverage on any code under test indicates that the code was not executed under test, and could not therefore have been validated against a requirement. In that sense, the true value of code coverage is more useful.

In the discipline of software testing, code coverage driven testing is considered "weak testing" in contrast to path coverage, which aims to test every conceivable path through the software ("strong testing" - bearing in mind that path coverage encompasses code coverage). At the level of unit testing, following the TDD methodology will inherently achieve path coverage, but as we progress through the degrees of integration testing, achieving path coverage becomes unfeasible.

This leads to a final question: what is the appeal of achieving 100% coverage, versus concise, multiple-path-covering assertions focusing on the "what" rather than the "how much"?

The answer is simple. Code coverage - being very easily quantifiable - is an easy way to gamify the writing of tests, and gamification is a proven mechanism to successfully challenge people to completion; much in the same way that the "collectathon" genre of video games - the later instalments of Assassins Creed, for example - encourage players to complete a set of something that would seem otherwise meaningless.

This is a powerful tool with which software engineers can be motivated, but in the same way that collecting feathers in the aforementioned game example is more of a "side quest" while completing the campaign story, achieving high code coverage measures part of your overall progress but not the whole picture.  Concise assertions, appropriate application of unit or integration tests (of varying degrees) where appropriate and - ultimately - confidence in the software you are building can be just as satisfying.  What you test is more important than how much you test.