Every software team wants to be agile, but how can we measure our agility?
In my early days as an Engineering Manager at Bench, we had a management retrospective for the previous quarter. One of the things we all agreed on was that we tended to shy away from deploying because frankly, it scared us. Geordie Henderson, the Head of Engineering at the time, did what good managers do here: he asked us to take this problem and deal with it head-on. “The best way to make deployments less frightening,” he said, “is to do them more often.”
We discussed the issue further, and eventually decided to hold ourselves accountable by setting a goal for total deploys the coming quarter. We also acknowledged that optimizing for deployment count alone could put our system at risk, so we also introduced a tension metric: uptime. Uptime helps keep deployment count honest; there’s no point deploying 500 times if it makes the system less stable. Conversely, deployment count keeps us from gaming uptime; a stable system that never changes won’t support our growing business.
So we picked some scary numbers and called them goals: 500 builds for the quarter, with an uptime of “three nines”, or 99.9%. Three nines was an intentional choice — we needed our system to be stable, but we also needed the ability to move quickly. Four nines would’ve reduced our speed, and two would’ve risked our stability.
What made these two goals more complicated was that we didn’t yet have the infrastructure to measure either in a scalable way. In order to discover our baseline we needed to do a bunch of manual work. Deploy count was relatively easy. At the time, we deployed roughly one monolithic deploy per day, though generally not on Fridays. This could be extrapolated to roughly 50 builds per quarter. Uptime was more complicated. We had a distributed system that was not designed to have its uptime measured. So we dug through logs across our system and eventually came up with a number of data points that bounced around but averaged at roughly 99%.
What was interesting was that in determining our baseline we had already been able to identify one of the biggest risks to our uptime. As mentioned above, we ran roughly one monolithic build per day. The issue here is that the build ran at 7am — if anything broke, the people who could fix it generally didn’t arrive until after 9am. This of course feels like a ridiculous situation, but it was one that had flown under the radar until we began focusing on these metrics.
We now had goals and a baseline. It was time to start the hard work of both improving those metrics and our ability to measure them. Here’s how we did it.
We use Jenkins to drive our continuous delivery pipeline. Continuous delivery in a Polyglot organization like Bench requires consistency and confidence in automated builds, tests and deployments. Every merge to master goes through unit tests, build, integration tests, deployment to pre-prod environment(s), testing of these environments, SDK generation and finally deploy/test to production.
To help us count deployments, we configured Jenkins to publish an event at each step of the pipeline. We then built an internal tracking tool, Platter, to consume these events. Platter displays a real-time view of the deployment stage for every service, and helps our team to understand the status of a service based on which version of the code is currently deployed or being deployed. In addition, it gathers valuable metrics such deployment count and duration, which can be grouped per service or aggregated.
Our team supports a number of business functions such as a Client-facing application, Bookkeeper-facing application, public website and so on. Usually these functions map to a service (or services) with a health endpoint, which allows us to measure uptime with a combination of continuous monitoring and user feedback. Our automated monitoring will pick up issues 9 out of 10 times, but having a feedback channel with our customers (internal and external) is also key for this.
Each week, any minute of downtime for any service is documented. The downtime of each service is weighted proportional to its importance for our internal and external customers. We then use these weighted scores to calculate overall uptime.
Many small deploys
While we were building this infrastructure, we were pushing our teams to hit the “deploy” button more frequently and with smaller changes. Smaller steps lowered risk and made it easier to identify the cause of the issues and roll them back if necessary. The end result was that deployments had become less scary, and our system had become more stable. However this new way of working had exposed another problem in our system.
As we deployed more and more frequently, we started to see issues with our build pipeline. Deploys were stacking up behind long-running CI tasks. We had solved the scary deployments problem, but we had created a new one: our builds were too slow for us to deploy at the rate we wanted. We were agile in aggregate, but we couldn’t react quickly enough in the short term.
The good news was that we now had experience with building infrastructure that allowed us to monitor key agile metrics, so it wasn’t too difficult to add a third: average build time. Platter was already primed to track this metric — we simply needed to expose it in the UI. The bad news was that we had a lot of work to do. Our average build time was 60 minutes.
Our work towards improving build time would require its own series of blog posts, so I’ll summarize. We invested in moving our infrastructure to Kubernetes, which significantly improved our build time (among many other things). We also invested (and continue to invest) in improving our UI-driven tests — we’ve upgraded infrastructure, added computing power, migrated tests to the unit- and integration-level, and deleted tests that weren’t needed. We’ve learned that you can never stop paying attention to build time.
Discovering metrics and building the infrastructure to use them at scale is essential for discovering inefficiencies and for prioritizing work to tune those metrics. It took us the better part of a year to be able to accurately report on deployment counts, uptime, and build time. We started with setting goals for each of them, and now we’ve graduated from goals to KPIs.
I’m happy to say that the work has been hugely successful. Last quarter we deployed 677 times. That’s more than 10 times per business day with a team of 23. We maintained an uptime of 99.92%, and an average build time of 12:31. As an organization we can now understand our agility not as a nebulous methodology but as a combination of key metrics that we track on a weekly basis.