How could teams with high code coverage have high operational load???
Stories about succeeding and failing to measure the right thing
All 3 companies that I’ve worked for in my 25 years in the industry (Google, Amazon, and Microsoft) are disciplined in gathering data and data-driven in their business decisions. Today, I wanted to combine three little vignettes from my life that cover interesting aspects of succeeding and failing to measure the right thing.
 I learned what weasel words were while writing my own promo doc
In 2011, I told my boss I believed I was ready to be promoted to Senior Engineer. “OK, write your promo justification!” he asked. So I fired up Microsoft Word, wrote up a couple of pages and emailed it to him. The next day, he emailed me back the document, with probably 40 or 50 highlighted sentences.
The only comment was “weasel words!”
I had no idea what that meant, so I stopped by his office the next morning and got myself a little lesson. “Weasel words” are words that add little to no value to a sentence. The origin of the term is unclear, one explanation being that weasels suck bird eggs through a small hole and leave the empty shell (a weasel word is basically that shell: looks good on the surface, but has no actual value inside). The term ‘weasel’ does have negative connotations in informal English to describe people being shifty, sneaky, sly or not trustworthy, but weasel words have nothing to do with that. If somebody is pointing out a weasel word in what you wrote, it doesn’t mean they are calling you sneaky or sly, it just means they want you to replace fluff with data and facts.
That was not quite the auspicious start I was hoping for. I was no stranger to promotion documents: I had been promoted a few times at Microsoft, and I had promoted four of my engineers during my stints as a manager there as well. Nobody had ever complained about the promo docs I had written before. But Amazon was a different company, and amazonians have a particularly visceral reaction to fuzziness.
So I went back and tried to replace as many weasel words with hard facts and data, removing personal opinions, subjective statements, fluff and words that didn’t add a lot of value. I sent him revision#2. It too came back the following day, but this time with about 20 highlighted sections.
The only comment again was “weasel words!”
The document went back and forth about four more times until there were no weasel words. With every revision, I got more and more annoyed that he kept sending it back. But with every revision, I learned more about myself, who I was, and how to tell my story. My manager wasn’t being a jerk, but he knew perfectly well my current writeup was never going to hold water in front of the promotion committee.
That was my trial-by-fire introduction to the way Amazon handles weasel words. Ever since, I’ve been very passionate about removing fluff in technical documents, and replacing subjective statements with facts and data whenever possible.
 How could teams with high code coverage have high operational load???
Fast forward a year. In 2012, getting busy Amazon developers to write proper unit tests was like pulling teeth. Today, it’s an industry-standard practice, but ten, fifteen years ago it wasn’t. I created a tool that measured code coverage in some novel ways, which helped highlight gaps in unit tests. I was excited that it was shining the spotlight on a problem (lack of proper test coverage) and it was changing behavior and culture (turning unit tests into something that you always did). My team liked it, so I wanted to spread it to other teams. I worked on it during weekends and nights as a side project to generalize it, and put it out there for anybody in the company to use. To my disappointment, it didn’t initially take off.
My grassroots, bottom-up approach for adoption had not worked. Could a top-down approach work? I wondered. So I grabbed some time with my VP, and explained my idea to him. I wanted his official endorsement for my little tool, so that his managers would give it a try. He was open to the idea, but wanted data to prove it was worth it.
My thesis was simple:
- My tool highlighted gaps in code coverage so that developers could address them. It would drive higher code coverage.
- Higher code coverage meant less areas of the product going untested.
- Less areas of the product going untested meant less defects escaping to production.
- Less defects escaping to production meant less operational pain for the team.
So I spent a week enabling my tool for every single team in my org, grabbing code coverage metrics and ticket count for each team, running statistical analysis over the last couple of years and plotting it all in a fancy Excel spreadsheet. I did this for about 45 teams.
I found out that teams with high code coverage had high operational load.
Wait, what ????
This was the opposite of what I expected. It didn’t make any sense. Intuitively, teams that had giant gaps in test coverage should be more likely to be letting issues escape to production, no?
I triple checked my data and my math. It was all correct. I was extremely annoyed with my findings. I had a very uncomfortable 1:1 with my VP when I had to tell him my data was disproving my thesis. He just smirked and said, “well, go figure out why!”
I spent the next couple of weeks diving deep into this correlation. Eventually, the culprit dawned on me.
Teams that had higher code coverage were generally teams with a stronger emphasis on operational and engineering excellence. With that emphasis on operational excellence came excellent prod telemetry and very tight thresholds for alarming, because these teams cared deeply about knowing when their systems were operating in a degraded state, and early intervention and mitigation. They had lots of alarms, which in turn led to a pretty high number of auto-cut tickets from the monitoring system, thus high operational pain.
The teams with low code coverage generally had not put a great deal of effort or thought into their metrics and alarms, so they didn’t have a lot of auto-cut tickets. Issues were escaping to production, but they just weren’t being caught. Didn’t their customers complain?, I wondered. Actually, no, because they had gotten used to those systems being unreliable so they had wrapped calls in retries, which masked flakiness. Some had even built entire mitigation mechanisms like caching to withstand a flakey dependency being down for a little while. They didn’t have great systems, they had crappy systems, but they didn’t know how crappy they were!
The foundation of my thesis was flawed. In fact, my thesis was just too simplistic. As I started thinking deeper about the problem, I realized that code coverage was one of dozens of dimensions that could impact operational load. More alarms could impact operational load, as could higher code churn, using more unproven cutting-edge technology, seniority of the engineers in the team, product complexity, and many more.
Even if I could have proven causation or correlation between higher code coverage and low operational load, I had just also realized that operational load was a poor proxy for product quality anyways.
I never did convince my VP to mandate his teams use my tool, but the story does have a happy ending. If you are curious about the rest, you can read bout it in my story “When Cats Look at Themselves in a Mirror they See Lions — How I changed the way we treat Code Coverage at Amazon, and got to meet Jeff Bezos!”
[And while I have your attention on the subject of code coverage in general, I co-wrote a whitepaper about Code Coverage Best Practices at Google with a couple of great Senior Staff Engineers that you might find interesting!]
 Getting data in Engineering Productivity improvements is hard
I work in Engineering Productivity. I’m the Tech Lead for Integration Testing Products at Google. My ultimate goal and passion is to reduce toil in testing processes for a hundred thousand or so google engineers, so that they have more time to create the customer-facing features that you all use and love. Before Google, I worked at Amazon’s Builder Tools, also with the charter of company-wide developer tools for editing, reviewing, building, testing and releasing code.
A customer-facing product generally makes money, so it’s pretty easy to quantify the impact of what you do in dollar signs. But how do you quantify your impact when your job is to make others more efficient?
One dimension in which I think about the impact of anything I do is that I do it to help others save resources. That often translates to developer time (which is a very expensive resource), but sometimes can be hardware (required to run the tests, or to operate the production fleets).
But saving resources is only one dimension. The reason people test is to prevent bugs from reaching production environments and impacting real customers. So, I can also translate my impact to less defects affecting the people who use our products.
How much time you save people, and how many bugs you prevent, are often very hard to quantify. Sometimes, there’s just no data. Other times, the data is indirect, so you can only approximate it with proxy data.
Back in 2013, I created the load and performance testing platform that Amazon uses to ensure thousands of its services were ready to handle peak traffic (“TPSGenerator”). In the early days, I needed to quantify its impact to convince leadership to continue investing in it, since it was just a pet project of mine constantly on the verge of getting defunded.
I could easily measure customer adoption: how many teams have onboarded, how often do they run their load tests, how many tests my platform runs daily, how often does it gate a release based on a latency regression. But customer adoption is an indirect metric. Just because more people are using my thing doesn’t necessarily prove that it has more business value to the company. Intuitively, if you think a thing provides value, the more people that use it the more value that it is providing to the company. But you need to prove value first.
I had the belief that the platform reduced the toil involved in load testing, from weeks to days. I figured if I could gather some data to backup that claim, it would augment my customer adoption metrics. The math could be pretty easy. I have X customers, I’m saving Y days of toil per customer, so I’m saving the company X*Y/365 years of developer time.
I conducted a large survey, asking people how long they spent load testing before they used my platform, and how much time it took them to do it using my platform.
This data was still subjective. Every person responding to the survey had not been measuring exactly how much time they wasted with a stopwatch, but they had a rough estimate. Even though each individual data point was subjective, aggregating it at large scale added credibility to the metric, because the sentence “XXX developers responded to the survey indicating that the framework saved them YYY days of toil” has a lot more credibility than “Carlos feels the framework saved ZZZ days of toil.”
A survey like that wasn’t guaranteed to be perfect. Maybe the people I sent the survey to weren’t representative of the entire company (selection bias). Or maybe their assessment on toil was way off.
I realized I was never going to gather a flawless metric, but I didn’t want to let perfection be the enemy of good. Between the time saved I had gathered from my survey (an estimate), and the actual number of customers using my product (a fact), I could say with some degree of confidence that the product had saved the company NNN years of toil. Yes, estimate * fact = estimate, but I got a huge amount of value out of deliberately measuring the impact of my work, even if it wasn’t statistically sound.
I also started gathering personal stories. Every email that I got from a customer saying “your thing is cool, I was able to do X that I couldn’t do before!” went into a document that eventually was 23 pages of customer testimonials. This wasn’t hard facts, but it captured a feeling of engineering satisfaction with what I was building.
Don’t be afraid to capture subjective information — it’s a good sanity check on the data that you’re gathering. It augments it. If your hard-facts and data tell you you’re successful, but your customers tell you you’re not… you’re not. And vice versa. There should be consistency between the two to have a credible, holistic story.
Some parting thoughts
- Continuously gather data to validate your belief system, and replace fuzzy statements with objective, data-driven facts whenever possible
- Even if you think the data will not be 100% accurate, even if it’s a rough estimation, don’t let perfection be the enemy of good.
- Augment objective pieces of data with subjective customer opinions — they should tell a cohesive story
- Sometimes, even the thought process of regularly thinking about what data you need is a useful thought exercise.
- If your thesis is flawed, your entire data gathering process can fail.
- When data debunks your thesis, apply common sense and dive deeper!