The Great Code Coverage Holy Wars of the 21st Century
Lessons learned over 25 years of testing software at Microsoft, Amazon and Google
Back at Amazon (where I worked 2009–2020), occasionally, a poor unsuspecting soul would mention the dreaded words code coverage in one of the large internal discussion groups. Every single time, these two seemingly innocent words would start a holy war with passions flaring in a fashion comparable to “vi versus emacs.” It didn’t matter whether the original comment was pro-coverage or against-coverage. The ensuing centithread would get ratholed by many generally useless dogmatic emails.
I had been a voice in the code coverage space since I created a company-wide tool that helped Amazon to be more disciplined about measuring coverage back in 2012. The tool started to shift some of the culture around unit tests and code coverage, so it was recognized by Jeff Bezos himself with a Just Do It Award. And then, for my last six years at Amazon, I was a Principal Engineer in Builder Tools, the organization that owned all the internal company-wide Engineering Productivity tools. One of these augmented the code review tool with code coverage information so that you could see lines covered as you reviewed your peers’ code changes. In all this time owning tooling and driving culture, I have seen the good, the bad and the ugly of code coverage.
So for a while, I attempted to mediate some of these strong opinions in the group discussions and steer the conversation away from the academic and into more pragmatic, actionable talking points. Eventually, I realized that I was pretty much saying the same things, over and over again. I started keeping track of these talking points. They were generally an attempt to bridge the gaps between the two camps: acknowledge some of the concerns of the against-coverage crowd as being legitimate, but offering some practical, useful ways to move forward for the pro-coverage crowd.
About two years ago I left Amazon and joined Google. During one of my first 1:1s with my Director, I described these talking points. She smiled and said, “we have the same arguments here, you should write down your thoughts in a whitepaper. In fact maybe you could even publish it in our external blog!” I knew all the Google Test Blog, because I had been following it for a decade. I was excited that now that I worked here, I had the opportunity to publish my thoughts in it, as it’s read by tens of thousands of engineers in the industry.
So I jotted down my views, and shared the whitepaper pretty broadly within Google. As expected, I received an enormous amount of feedback, some useful, some not. At first, I was a bit overwhelmed by the comments, and unsure I could ever reconcile enough of the comments to publish it with broad support.
There were two other Senior Staff Engineers (Adam and Marko) who responded with a whole bunch of comments on the doc. The 3 of us disagreed on plenty of details, but I could see there was room for common ground because we shared a lot of fundamental views and desired outcomes. I was brand new to Google, working remote and lonely in the middle of a pandemic, and excited to meet googlers, so I scheduled some 1:1 time with both to chat about the comments. Clearly they had the same passion, they were smart and knowledgable, and they had been thinking deeply about the space for a long time like me.
Which leads me to one of the take-aways from my blog today. While it seems silly to write a blog about writing a blog, I do think there’s some take-aways in the “making-of” story. So here it goes. I strongly believe that if you want to create a better anything (whitepaper, code, design, policy, …) you need to seek dissenting opinions, because [a] it forces you to articulate your case better, and [b] sometimes these dissenting views end up changing your own views. On the opposite end of the spectrum, time and time again you see terrible politicians surrounding themselves with yes-men, because life is easier that way. But great ones surround themselves with people who will challenge them.
So as overwhelmed as I was by the prospect of dealing with all the comments I had received, I knew I would end up with a much stronger writeup if I chose the harder path of seeking dissent and driving consensus. Adam, Marko and I continued to meet for a few weeks, and continued to exchange many comments over email and googledocs. They generously donated their time, and during the process, we became friends and we became comfortable trusting each other and learning from each other. The final doc was a result of our collaboration more than my initial individual work, so I wanted them to be my co-authors. We published it on August 2020.
I thought that was the end of my story, but I recently learned the blog was mentioned in the book Effective Software Testing: A developer’s guide
by Mauricio Aniche. It’s a great book by the way, and I really enjoyed seeing my name mentioned in a book for posterity!
Without further ado, here’s the whitepaper in all its glory:
Code Coverage Best Practices
Originally published Friday, August 07, 2020 at the Google Test Blog
By Carlos Arguelles, Marko Ivanković, and Adam Bender
We have spent several decades driving software testing initiatives in various very large software companies. One of the areas that we have consistently advocated for is the use of code coverage data to assess risk and identify gaps in testing. However, the value of code coverage is a highly debated subject with strong opinions, and a surprisingly polarizing topic. Every time code coverage is mentioned in any large group of people, seemingly endless arguments ensue. These tend to lead the conversation away from any productive progress, as people securely bunker in their respective camps. The purpose of this document is to give you tools to steer people on all ends of the spectrum to find common ground so that you can move forward and use coverage information pragmatically. We put forth best practices in the domain of code coverage to work effectively with code health.
Code coverage provides significant benefits to the developer workflow. It is not a perfect measure of test quality, but it does offer a reasonable, objective, industry standard metric with actionable data. It does not require significant human interaction, it applies universally to all products, and there are ample tools available in the industry for most languages. You must treat it with the understanding that it’s a lossy and indirect metric that compresses a lot of information into a single number so it should not be your only source of truth. Instead, use it in conjunction with other techniques to create a more holistic assessment of your testing efforts.
It is an open research question whether code coverage alone reduces defects, but our experience shows that efforts in increasing code coverage can often lead to culture changes in engineering excellence that in the long run reduce defects. For example, teams that give code coverage priority tend to treat testing as a first class citizen, and tend to bake stronger testability into their product design, so that they can achieve their testing goals with less effort. All this in turn leads to writing higher quality code to begin with (more modular, cleaner contracts in their APIs, more manageable code reviews, etc.). They also start caring more about their overall health, and engineering and operational excellence.
A high code coverage percentage does not guarantee high quality in the test coverage. Focusing on getting the number as close as possible to 100% leads to a false sense of security. It could also be wasteful, burning machine cycles and creating technical debt from low-value tests that now need to be maintained. Bad code being pushed to production due to missing tests could happen either because (a) your tests did not cover a specific path of code, a test gap that is easy to identify with code coverage analysis, or (b) because your tests did not cover a specific edge case in an area that did have code coverage, which is difficult or impossible to catch with code coverage analysis. Code coverage does not guarantee that the covered lines or branches have been tested correctly, it just guarantees that they have been executed by a test. Be mindful of copy/pasting tests just for the sake of increasing coverage, or adding tests with little actual value, to comply with the number. A better technique to assess whether you’re adequately exercising the lines your tests cover, and adequately asserting on failures, is mutation testing.
But a low code coverage number does guarantee that large areas of the product are going completely untested by automation on every single deployment. This increases our risk of pushing bad code to production, so it should receive attention. In fact a lot of the value of code coverage data is to highlight not what’s covered, but what’s not covered.
There is no “ideal code coverage number” that universally applies to all products. The level of testing you want/need for a set of code should be a function of (a) business impact/criticality of the code; (b) how often you will need to touch/change the code; © how much longer you expect the code to live, its complexity, and domain variables. We cannot mandate every single team should have x% code coverage; this is a business decision best made by the owners of the product with domain-specific knowledge. Any mandate to reach x% code coverage should be accompanied by infrastructure investments to make testing easy, such as integrating tools into the developer workflow. Be mindful that engineers may start treating your target like a checkbox and avoid increasing coverage beyond the target, even if doing so would be prudent.
In general code coverage of a lot of products is below the bar; we should aim at significantly improving code coverage across the board. Although there is no “ideal code coverage number,” at Google we offer the general guidelines of 60% as “acceptable”, 75% as “commendable” and 90% as “exemplary.” However we like to stay away from broad top-down mandates and encourage every team to select the value that makes sense for their business needs.
We should not be obsessing on how to get from 90% code coverage to 95%. The gains of increasing code coverage beyond a certain point are logarithmic. But we should be taking concrete steps to get from 30% to 70% and always making sure new code meets our desired threshold.
More important than the percentage of lines covered is human judgment over the actual lines of code (and behaviors) that aren’t being covered (analyzing the gaps in testing) and whether this risk is acceptable or not. What’s not covered is more meaningful than what is covered. Pragmatic discussions over specific lines of code not covered that take place during the code review process are more valuable than over-indexing on an arbitrary target number. We have found out that embedding code coverage into your code review process makes code reviews faster and easier. Not all code is equally important, for example testing debug log lines is often not as important, so when developers can see not just the coverage number, but each covered line highlighted as part of the code review, they will make sure that the most important code is covered.
Just because your product has low code coverage doesn’t mean you can’t take concrete, incremental steps to improve it over time. Inheriting a legacy system with poor testing and poor testability can be daunting, and you may not feel empowered to turn it around, or even know where to start. But at the very least, you can adopt the ‘boy-scout rule’ (leave the campground cleaner than you found it). Over time, and incrementally, you will get to a healthy location.
Make sure that frequently changing code is covered. While project wide goals above 90% are most likely not worth it, per-commit coverage goals of 99% are reasonable, and 90% is a good lower threshold. We need to ensure that our tests are not getting worse over time.
Unit test code coverage is only a piece of the puzzle. Integration/System test code coverage is important too. And the aggregate view of the coverage of all sources in your Pipeline (unit and integration) is paramount, as it gives you the bigger picture of how much of your code is not exercised by your test automation as it makes its way in your pipeline to a production environment. One thing you should be aware of is while unit tests have high correlation between executed and evaluated code, some of the coverage from integration tests and end-to-end tests is incidental and not deliberate. But incorporating code coverage from integration tests can help you avoid situations where you have a false sense of security that even though you’re not covering code in your unit tests, you think you’re covering it in your integration tests.
We should gate deployments that do not meet our code coverage standards. Teams should debate and decide which gating mechanism makes sense to them. You should however be careful that it doesn’t turn into being treated as a checkbox that is required to be filled, as it can backfire (pressure to ‘hit the metric’ almost never yields the desired outcome). There are many mechanisms available: gate on coverage for all code vs gate on coverage to new code only; gate on a specific hard-coded code coverage number vs gate on delta from prior version, specific parts of the code to ignore or focus on. And then, commit to upholding these as a team. Drops in code coverage violating the gate should prevent the code from being checked in and reaching production.