Our Journey to Continuous Delivery for a 300+ Person Engineering Team at Compass

John Gerhardt
Compass True North
Published in
11 min readSep 22, 2020

“I’ll put it bluntly — as an organization, we were afraid to deploy to production.”

But we were resilient, and we pushed through that fear, and you can too. I’ll walk through how our engineering team at Compass went from a vast, orchestrated deploy process in early 2019 to a highly decoupled deploy process in 2020 that has unlocked tremendous engineering velocity and transformed how we build software.

Let’s start with a story.

It was a Monday afternoon in early 2019 when an engineer on one of our product teams realized they had missed the window to have their change be part of the upcoming deploy window. The engineering leader on the team was eagerly awaiting the deployment of this feature because it would enable them to email the beta group associated with the feature and let them know they could start using it. The team was even ready to pop a bottle of champagne and celebrate after work.

Photo by Alexander Naglestad

Waiting on manual QA and a bug in an unrelated service, the change couldn’t go out and the engineer missed the deploy cutoff. The cutoff was 4 pm ET, which meant they had to wait another two days to be part of the next orchestrated, production release cycle. The team was demoralized. They put the champagne back in the refrigerator, but the celebration two days later didn’t feel as sweet.

To some of you reading this, that may not feel like a long time to be delayed. To others, including myself (the engineering leader in this story) at the time, it felt like an eternity.

What was the problem? Unrealized Speed of Innovation.

In 2019 at Compass, our Product and Engineering organization relied on orchestrated deployments to release new software. We were breaking up a monolithic Python application into microservices, which meant our deploy schedule matched our architecture.

Everything went out at the same time. On Mondays and Wednesdays, we deployed to our Staging environment, and on Tuesdays and Thursdays, we deployed to our Production environment. This schedule left the team 24 hours to perform regression tests, communicate any issues, add bug fixes, redeploy and promote to production. If we couldn’t validate that the environment was stable, we’d abandon the deploy window and nothing would get deployed at all. A crushing blow those days were…

The obvious problem here is that a single team could block the deploy process of dozens of unrelated groups. There was tight coupling, both technically and process. It was clear that this was affecting engineering velocity, our product team’s ability to iterate and learn, confidence across the organization, and, most importantly — our customers.

If It Hurts, Do It More Often.

One path forward was to lean into the existing deployment model and make it more reliable. When there were issues, we had post mortems and tried to identify ways we could address those issues faster so we could still hit the deploy windows. The more we discussed, the more we were sure this approach wasn’t viable. We were a Product and Engineering team of 150 with aspirations to be over 400 by the end of 2019, and we knew we had to break out of this highly-coupled model.

To use an analogy, if you only go on a run every three weeks, you’re going to be pretty sore after every run. The more frequently you run, the more your muscles strengthen and expect this workload. Deploys are no different. Your team needs to exercise these sore muscles more often to strip away the parts of the process that create drag and inefficiency.

Photo by Jeremy Lapak

It’s about continuous, daily improvement — the constant discipline of pursuing higher performance by following the heuristic “if it hurts, do it more often, and bring the pain forward.”

— Jez Humble

Create a Vision for the Future of Deployments.

We needed to have a clear picture of where we were going with these changes, and we needed to be able to communicate that. You’ll need to do the same to create buy-in from leadership and your peers.

With more frequent deployments, our Product and Engineering organization can iterate on our product more quickly, which leads to our engineers consistently delivering value to our customers. We envisioned a world where the more than 60 teams could all be deploying their microservices to production without the extensive orchestrated process we had historically relied on.

We made it very clear that we were going to create a world where deploys were seen as reliable, non-suspenseful events. Eventually, we wanted them to disappear into the background entirely unless something went wrong.

How Do You Get There?

There was pushback. “It’s the only way we know all the currently deployed services work together,” or “We can’t promote that to production because we haven’t seen how it behaves with the previous version of the service it depends on,” were commonly heard on deploy days or as reasons not to move forward.

“Fear of deploys is the ultimate technical debt.”

— Charity Majors

Describing the current state and the ideal destination is often fairly trivial. We can read about Continuous Delivery and Continuous Deployment and say, “We should do that!” If it were that easy, you’d probably already be there and not reading this article.

The difficultly is in describing the N stepping stones that will take you from your current state to your ideal destination. This transition ambiguity is present in any considerable organizational, cultural, and technical change. Some of those steps will be incredibly uncomfortable, a leap of faith, or even a step backward to move forward. For some of you, it may feel impossible right now. For others, you might be able to start tomorrow.

Use These 6 Steps to Bring Your Organization to Continuous Delivery.

The first thing you should do if you find yourself in a similar situation is come up with a gameplan. This plan may take you a month, six months, or six years to execute depending on your situation and your company. At Compass, we’ve been hard at work at it for about a year and a half, and while we’re proud of the progress we’ve made, we’re nowhere near done.

Here are the major milestones, each of which I’ll describe in a bit more detail.

  1. Understand the Historical Context. How and why did we get here?
  2. Demonstrate the ROI to leadership and your peers.
  3. Enumerate the cultural and behavioral changes required.
  4. Enumerate the technical changes required.
  5. Create Incentives within the Organization.
  6. Take one step forward, let Newton’s first law kick in.

Let’s jump into each of these in a bit more detail. If you’ve had a similar experience bringing an organization to Continuous Delivery, please post in the comments. We’d love to hear from you.

Understand the Historical Context.

Photo by Thomas Kelley

If you’re new to the company where you’re feeling pain similar to what I’ve described, take a breath. Avoid the knee-jerk reaction of, “what the heck were these people thinking?! Who thought this was a good idea?!” It is very often the case that the current situation is the result of a series of “least bad decisions” that occur over months or years.

Understanding the decisions that have been made to bring your team to its current state and appreciating the information available at the time those decisions were made is paramount. If you don’t, you will risk alienating the people you will need help from the most.

Demonstrate ROI to Leadership and Your Peers.

Photo by StellrWeb

As much as I cringe when I say it, everyone speaks dollars. Turn the problem into dollars and demonstrate that the cost of doing nothing far exceeds the cost of making the investments needed to move to Continuous Delivery and eventually Continuous Deployment.

An example of a sub-workstream might be further parallelizing your CI process to shorten the deploy feedback loop. Let’s look at a few quick examples:

  • Your current CI process takes 30 minutes.
  • Your CI process runs 250 times per day.
  • With an additional investment of $10k per month, you can shorten it to 10 minutes through increased parallelization.
  • Your team has 100 engineers, and for easy math to demonstrate the scenario, they make $50/hour.

By investing $10k per month, you will free up 5,000 engineering minutes per day worth approximately $130,000 per month. That’s an ROI of 1,200%! I assure you, your CFO will be happy.

Is this math perfect? Are there holes in the assumptions? Absolutely. Will all those freed-up minutes be used with 100% efficiency? No. You’ll never be able to demonstrate the ROI perfectly, but what we are looking for is how many orders of magnitude are you leaving on the table by doing nothing.

Not only have we found an ROI-positive project for the company, but we’ve also shortened the iteration cycle at the same time. The path to Continuous Delivery will be riddled with small and medium wins like this, even if you stop short of Continuous Deployment. Any time you run into friction in getting the work prioritized, use an ROI framework to demonstrate why the investment is justified.

Enumerate the Cultural and Behavioral Change Required.

This one is probably the hardest. Culture and process have a way of reinforcing each other over time. And suppose you’ve historically relied on a monolithic deploy process. In that case, it’s going to affect how your team thinks about and approaches problems. It’s likely even implicitly ingrained in business logic throughout your stack.

One way we were able to demonstrate both to leadership and our peers that this was the path forward was to work with a few early adopter teams who were conscious of the risk they’d be taking but understood the long-term value in this shift in thinking for the company.

Identifying early adopters and celebrating them throughout the transition is critical to effectively changing the culture around deploys.

Photo by Ross Findon

As far as day-to-day change goes, the devil is going to be in the details. As an example, you’ll see Slack messages about “We should probably hold off until the next deploy window,” or “I think that’d be safer to hold off until Monday.” Respectfully press them on “Why?” Ask them what would need to change for that not to be true. What would you need to have the confidence to move ahead now? Maybe that thing doesn’t change immediately, but it gets people thinking with that mindset.

Here are a few we assembled:

  • Align on small, discrete changes. When an engineer has produced the smallest possible change that can stand on its own, it should be deployed to production.
  • The smaller the surface area of any given deployment, the easier it is to reason about and debug issues derived from it.
  • When in doubt, rollback. Regain control through stabilization, then fix the problem and deploy again.
  • Avoid blame when things go wrong. Have a post-mortem to analyze the root cause. Ask what automation needs to be in place to prevent a miss like this from happening again. Create JIRA tickets to prevent that issue from recurring.
  • Embrace failure. It’s not a matter of “if” something will fail; it’s “when.” Great engineers plan for that failure and their code and the product designs for it.

Enumerate the Technical Changes Required.

I imagine some of you may have scrolled down straight to this paragraph, but the truth is that every team and company will look different. However, there are certainly common components, validations, and processes that ensure a smoother transition to Continuous Delivery.

Not all paths are the same. This was ours at Compass.

One of the main blockers for us was the speed at which we could deploy or roll back one of our services. Some of our build processes took as many as 30 minutes, and so did our rollbacks. If you have to wait 30 full minutes to undo a bad deploy, the instinct to be afraid of deploys is completely valid. We spent the better part of a year working with teams across the organization to containerize their applications and migrate our execution substrate from EC2-based instances to Kubernetes with EKS.

If I know I can undo a deploy in 60 seconds, I’m willing to take on a lot more risk and move faster than if it takes 30 minutes. You’ll need to identify these types of bottlenecks in your own company and address them as a prerequisite. In our case, it was actually a carrot we were able to dangle to encourage teams to migrate to our new and improved deploy tooling.

Here are some other key criteria for a successful Continuous Delivery model:

  • Ensure there are unit and integration tests for your service, enforced through automation.
  • Enable the use of feature flags and quiet/dark launches.
  • Shift security checks left in the process to catch them before your staging environment, ideally before code is even merged.
  • Ensure container image builds are fast, automated, and reliable.
  • Work to shorten service boot times so that container restarts are fast and stable.
  • Ensure there is thorough monitoring for latency, throughput, errors, and saturation that integrates with automated on-call and incident management tooling.

Create Incentives within the Organization.

We had a Jenkins job that would kick off all service deploys that were part of the legacy, orchestrated process in parallel. This was the natural place for us to start. As teams demonstrated they had the above checks and balances in place, they would be celebrated with “Anytime Deploy” status.

It was a badge of honor that showed they were ready to embrace Continuous Delivery. When a team would achieve this, we’d celebrate the milestone in our #win-room in Slack to show leadership and peer teams really cared about the progress being made. Celebrating each small win helped us build momentum, and reinforce the shift to Continuous Delivery as positive behavior.

Eventually, we reached the critical mass where every team had reached that point, and suddenly the thought of an orchestrated deployment felt silly. There was no need for the badge of honor because it had become part of our culture.

So How Do You Take The Smallest Step Forward?

Well, you’re here, so that’s a start! My team jokes that I rely on this chart far too much, but it’s a great place to start.

Is It Worth the Time? by xkcd

Regardless of the size of your company — first, quantify the problem. At Compass, we’re on track to deploy about 55,000 times per year as of September 2020. That’s enough to warrant a pretty heavy investment in ensuring our deploy pipeline is in great condition.

Maybe your team or organization has more pressing issues to address, and this genuinely should be lower on the list. The first step is to quantify the problem and socialize it with your organization to get buy-in.

Do you have a similar story? Are you interested in solving similar problems at a rapidly growing company? Have a chat with us and see if it’s a mutual fit to help us on our mission to help everyone find their place in the world.

Compass is looking for experienced software engineers who are passionate about solving complex problems with code. We’ve taken a novel approach to building business software — focus on the end-user — and it’s been working. Our users love us. Come help us build a product that makes contact management easy and rescue thousands of people from the jaws of clunky, outdated software.

--

--