When Cats Look at Themselves in a Mirror they See Lions

How I changed the way we treat Code Coverage at Amazon, and got to meet Jeff Bezos!

Carlos Arguelles
Geek Culture


I’ve spent a good chunk of my professional life at Google, Amazon and Microsoft focusing on engineering productivity tooling and processes. I’m fascinated by how a little bit of toil incurred here and there by each engineer aggregates to millions of dollars of productivity lost for a large software company. And also, I’m fascinated by a simple tool’s ability to change culture at large. This is a story of how a crappy little tool I wrote ended up changing a lot of the culture around unit testing and particularly test coverage at Amazon.

Before I go too much further, the topic of the value (or lack thereof) of code coverage is a highly debated topic, with passion on both sides. I find that most discussions end up being philosophical, dogmatic and pedantic and I’m a pragmatic guy. I believe it to be a valuable tool with caveats and using common sense. I wrote a blog for the official Google Test community about this about a year ago that articulates my thoughts that I’d love you to read.

Around 2011, getting busy Amazon developers to write proper unit tests was like pulling teeth. People wrote unit tests, but amzn didn’t have much in terms of metrics or gates to be actually disciplined. You wrote your code, you wrote some unit tests that more-less looked good, sent the code review, your reviewer eye-balled the unit tests by squinting real-hard, and that was about it.

Fast forward to 2021, and the culture around unit tests and code coverage at Amazon has changed dramatically. Code coverage is ubiquitous.

  • Pre-submit. The build file for each project has a default target that builds the code, runs the unit tests, and gets code coverage. You can add rules to fail the build if code coverage drops below a certain hard number.
  • Code Review. As a reviewer you can see which lines in the proposed changes are covered by tests and which aren’t. Rather than getting hung up on a dogmatic “hey your code coverage is below x%” you can have more meaningful conversation on a pragmatic and actionable “hey I noticed this specific line of code isn’t covered by your tests, I think it’s important enough you should cover it because…”
  • Post-submit. There is also a way to add gating around code coverage for an entire service in its CI/CD setup. This gate focuses more on deltas, rather than hard-coded numbers (so, you just checked in: did you make things better or worse? Boy Scout rule). [ Side note: If you want to learn more about the way Amazon does CI/CD, here’s two great blogs by good friends: Continuous improvement and software automation and Automating safe, hands-off deployments. ]

So, how did Amazon culture change so much in a decade? A lot of factors. The industry started paying more attention to unit tests, and code coverage as a way to be more disciplined about unit testing. Tooling got better. And many people like me advocated relentlessly.

My code coverage story starts in 2011. I was trying to help a team at Amazon identify and understand test gaps. The team owned about a dozen services, with code distributed among hundreds of Java packages, so it was hard to get the “big picture” of testedness and identify testing gaps. I wrote a little crawler that, given a list of packages, fetched the official code coverage information from unit tests and made an html report. It was rudimentary, but it gave us a nice “this is your entire world” view. I started showing it regularly at team meetings, and developers around me seemed to like it and started paying attention to it.

Since the reception had been positive, I thought, maybe I should advertise it more broadly to other teams around me. My first script had a hard coded list of packages, so I had to clean it up a bit before going broader. Next few months, I spent a significant amount of my free time working on little improvements. I modified it to call the code ownership service to dynamically fetch packages that a team owned, so that newly created packages magically showed up, and added wildcards to include or exclude packages. I added the ability to send an email with the html report. I socialized it with teams under my VP, and waited. And waited. And waited. Nobody onboarded! It was a total and utter failure. I felt I had wasted my time making my tool more feature-rich, and that people should care about measuring their unit tests with data and discipline, and I was sad when the appetite didn’t seem to be there.

My first attempt at socializing my tool was a failure…

So my tool sat there, gathering dust and spiderwebs for a couple of years.

What happened next was a simple twist of fate. My boss Claire went on maternity leave for six months and she asked me to hold the fort down while she was out. I had zero interest in being a manager, but I figured this was a finite commitment, with an exact start and end date. I would learn a thing or two about management at Amazon, and then go back to being an individual contributor when she came back. Claire had a tough and thankless job, that I had now inherited. I was the QA Manager for a VP with an organization of about 40 teams, and I had 20 engineering productivity engineers (SDETs) under me. Essentially my first order of business was matching 20 engineers to 40 teams. Some teams were going to be without a dedicated SDET, or have a partial one. But which teams? Every team felt they were important enough to get one.

I went on an information gathering spree. I had coffee with the manager of each one of these teams, and asked lots of questions to assess the team’s maturity around test practices, to figure out where I would get the best ROI for my 20 engineers. When I finished, I realized that each manager had told me roughly the exact same thing: everything was great. I had committed the cardinal sin of getting a subjective assessment. I needed an objective assessment, with facts and data, not opinions and feelings.

One of the things I needed was to understand test gaps in the entire org. That’s when I remembered that old code coverage aggregator tool I had written. I dusted off the code, pointed it at the entire org, and I got a pretty html report that showed me just how dire the situation was.

What we had here was a case of a cat looking at himself in the mirror, and seeing a lion.

Some teams (including some of the teams that had insisted their testing was great) had very low code coverage in their unit tests! It was concerning. I showed the data to the same managers that had told me everything was great, and they were shocked! They had no idea. I could prove with data that large chunks of critical code were going to production without being exercised by any tests. Nobody had even thought about taking a methodical, data-driven approach to measuring coverage across an entire org like that. I could even show historical trends to show them that their coverage had been getting progressively worse as engineers added code!

It wasn’t malicious. Engineers wanted to do the right thing. They simply lacked the data to surface the fact that they weren’t doing the right thing. Once I started publishing the reports to VP and managers, there was a new emphasis on improving test coverage.

That was the spark I needed to (re)light the fuse. My thing was useful! I should try again to socialize it with a broader audience. My main miss the first time around was that the onboarding story was awful. I cringed as I re-read the instructions I had written two years ago, now having the benefit of some time elapsed. The instructions were so bad, I wouldn’t even onboard onto my own tool! So I focused on making the onboarding story ridiculously simple.

I made three simple technical decisions that would come back to haunt me in the most spectacular way in a few months. I had to make it faster and more reliable. One of the things I was doing was call an RPC service to fetch the official code coverage numbers from the last successful checked-in build from the head of the main branch. I noticed the call was pretty slow, and running the tool against hundreds of services under my VP took a couple of hours (and sometimes failed half way thru). “Aha! Multithreading to the rescue!” I thought, proud of myself. So I threw the problem at a threadpool with 16 threads. Nevermind that the calls were slow, I was making 16 of them concurrently so it was much faster. I also was annoyed by the fact that these RPC calls sometimes failed. “Aha! Retries to the rescue!” I thought, proud of myself again. So I added 3 retries with exponential backoff. I now had a script that was fast(ish) and reliable(ish). Lastly, how could I make sure my tool ran automatically on a cadence? I added a default to run every day as a cronjob at midnight, Seattle time.

(a) Running in a threadpool and (b) adding retries… seemed like good ideas…

The changes I made to simplify the onboarding experience really tipped the scale. I don’t know what it was about my crappy little tool, but it started spreading like wildfire. I think it just filled a need people had, it was just the serendipity of the right tool at the right time. The tool itself was unremarkable. But it really created awareness around gaps that people didn’t have before, and it gave engineers actionable data. It built enthusiasm. It really did change the way people thought about code coverage and unit tests. The cultural changes the tool drove were fascinating. It was adopted by hundreds of teams the first few months alone, and these teams significantly improved their code coverage numbers in those months.

The most unexpected person noticed these cultural changes. Jeff Bezos! Amazon has a coveted internal award called Just Do It Award. Every year Jeff Bezos himself picked an employee or two who exemplified the core values of innovation and bias for action, by creating something impactful outside their day job. Jeff decided to give the Just Do It Award of 2013 to my crappy little tool. Here’s the story of that day… meeting Jeff was one of the highlights of my professional life.

Jeff and I, Just Do it Award, 2013
Brian Valentine on stage announcing the Just Do It Award 2013!
2013 Amazon Company Meeting… stadium filling up!

I told you some simple technical decisions would come back to haunt me in the most spectacular way. That night was the night!

The Just Do It Award is the biggest spotlight you can possibly shine on a thing at Amazon. There was thousands of people in that stadium as I went up on stage, and it was being broadcast live worldwide to all satellite offices. A fair bit of them were intrigued, and decided to onboard to see what the report looked like for their teams. Onboarding was so simple, right? You just typed a command, and the thing installed itself as a cronjob in your machine. Unknowingly, I had created a ticking timebomb. At exactly midnight PST, thousands of newly created cronjobs woke up all over Amazon and started running my script. Yes, all at the same time. Not just that, but each and every one of those scripts started running their 16 concurrent threads, so the problem was 16 times worse. When the RPC service that returned the official code coverage metrics got this sudden spike in traffic, some of these calls started failing. I had built just the thing to help with failures: retries! So between multithreading and retries, the problem wasn’t 16 times worse, it was up to 48 times worse. All in all, that poor unsuspecting RPC service got 1000x normal traffic. It bent over and died an unceremonious death, paging all kinds of people who were sleeping at midnight in Seattle. Troubleshooting was a nightmare, because as far as they were concerned, there was thousands of unrelated machines suddenly calling an obscure API that nobody had really called before. It wasn’t just one poorly behaved client, it was thousands, all over Amazon! And just like that, they stopped, leaving the poor oncall puzzled but relieved that the weird denial of service attack had stopped.

The next day, at midnight PST, the exact same thing happened. I brought down the service a second time.

How those simple technical choices put 1000x the normal load onto my dependency (ooops)

I think it took 3 days to finally track me down. Their Principal Engineer gave me a stern lecture (we would after that become really good friends!). The technical decisions I had made weren’t terrible… multithreading and retries are common. But I had failed to think about how some of these were going to behave at scale. Most importantly, I had never bothered to grab time with the team that owned that unsuspecting RPC service and tell them I was going to be using them extensively. As it turns out, that RPC call was extremely expensive, and had I bothered to talk to them, I would have learned there was a much simpler, faster, cheaper RPC call to achieve the same. Always talk to your dependencies, they are your partners for life!

The tool I created is long defunct. And that’s ok: it was the right tool for the times, but other (better) tools followed. Other engineers picked up the torch. Having had Jeff Bezos shine the spotlight on it lent a lot of credibility to the space. My tool grandfathered a sleuth of innovation, and that’s how today code coverage is embedded into the code review tool (“Coverlay”), and the CI/CD tool (“Code Coverage Police”), at Amazon. Some of the code I wrote for my crappy tool lives on in those other tools today. Code coverage went from being a second class citizen to being something a lot of people cared about in the company.

A tool is never just a tool. Think long and hard about its ability to change culture, refocus attention, and create enthusiasm around you.

Credit here. Goodhart’s law is a reason Code Coverage conversations can get ratholed sometimes! But I figured engineers are smart and given the right data will do the right thing.



Carlos Arguelles
Geek Culture

Hi! I'm a Senior Principal Engineer (L8) at Amazon. In the last 26 years, I've worked at Google and Microsoft as well.