From legacy to continuous deployment

Solon Aguiar
Checkr Engineering
Published in
19 min readAug 14, 2023

Legacy software is part of the complex backbone of many organizations, providing crucial functionality to the business and impacting the day to day of a large percentage of its users. As such, maintenance and evolution of these systems is crucial to business success and a big part of the engineering team’s roadmaps.

There are many ways to achieve the same objective

Working with legacy software requires extra effort as most of the original writers are not around anymore and, more often than not, its original purpose has long been forgotten or replaced. Despite the extra difficulty, legacy software ought not to be a mythical creature to be left untouched. We, as software engineers and software organizations, need tools and processes to deal with it so that it continues to successfully deliver business value. Failure to do so could mean catastrophic results to the organizations that depend on it.

In this post we will cover how we at Checkr went from manual and blinded deployments to full continuous deployment (CD) for some of our legacy systems.

This article is by no means exhaustive, nor even attempts to claim to be a silver bullet to all problems software teams face when working with legacy software (there’s extensive literature about that already). This is the story of one of the problems that we had at Checkr and how we chose to deal with it. We believe that some of the framework shared herein is transferable to other organizations, but are fully aware that every situation is different, so feel free to adapt it as you see fit. We sincerely hope that by sharing the challenges that we faced and how we overcame them, we will help other teams be inspired to take on their own challenges and come up happier and more confident in the end.

What’s legacy?

First, we need to define what legacy software means in this context. For this article, we’re taking a simplified definition of legacy: any code that was handed down to us. It doesn’t necessarily mean outdated software, though it was in our case. We use it to refer to code that was written by other people (or teams) in the past and that we now have to maintain and evolve.

Not the kind of legacy we are referring to

Principles

The following subsections will explore the three principles that guided our journey. We didn’t explicitly define these before we started our work (though it wouldn’t have hurt to do so), they just happened to be on the back of our minds based on our experience and goals. Having them around, even implicitly, however, helped guide us as a team and organization to be aligned to achieve what we wanted.

Define why you want/need it

This first principle is often overlooked. A lot of the time one can get trapped into thinking that they need something because they hear about it a lot or because everyone else is doing it. Having a clear understanding of the reasons why you want to get to continuous deployment is important to help you solidify your position and advocate for what you need. In some rare cases, this understanding might show you that this is not actually what you need or where you’ll best spend your time.

As individuals working for a business, clearly forming a strong argument (preferably backed with data) will help you to get buy-in from stakeholders that ultimately make the decision of sponsoring or not projects like this.

Even though this principle is a bit fuzzy to discuss, we should not take it for granted. Before we set out to work on this project, we clearly articulated to ourselves and to the business the reasons why we wanted to get to fully automated CD: reduce the time it took to get changes out, reduce the size of releases (and thus the likelihood of errors) and reduce time spent on resolving code conflicts. The Checkr management team clearly understood our arguments and the technical and business benefits of a project like this and gave us the go ahead to allocate time and resources for it.

Visualize it all

Working without knowing what you’re doing or how you’re doing it is hard. How can you reason about something or have the confidence to do some work if you don’t know how it fits into the big picture? How can you know the effect of some proposed work or change? How do you actually figure out the current state of a system? These are just some of the questions that came to us.

Being able to distill a complex, abstract state, piece of software or interaction into a visual form that you can quickly reason about will help you cross the chasm between the unknown that comes with the lack of confidence to act, into the realm of achievable and plannable work. The ability to have an overview of the current state will give you a deeper understanding of what you are dealing with. It also facilitates communication with other engineers and stakeholders. With different representations you can start to build a picture of what you’re dealing with so that you can plan how to tackle it and foresee the impact of our actions.

Visualization helps planning

Visibility by itself goes a long way, but it has limitations — it is a lagging indicator. It shows things in retrospect, not in real time, because it can’t predict the future — no one can (yet). Though you can map out new interactions, there will always be some unknown. That doesn’t mean that you cannot prepare for it, however. You can and should, which brings us to our next principle.

Create a plan (or many of them)

Even with the utmost planning guided by insight into the system (powered by visualizations — see principle above) things will slip through the cracks. It is naive to expect that you’ll foresee all of the outcomes of a decision or that you’ll be able to map out all of the scenarios and contexts where a software will perform. For all of these and many other cases that you don’t know of, you need a set of contingency plans, or guardrails.

Especially with legacy systems, where there are already multiple dependencies and/or users, it is important to be able to act quickly and mitigate problems. The contingency plans will help you be confident in how to act when things go wrong, rather than having to scramble to first figure out how to do something, letting you focus on just figuring out what you have to do (while the step by step of how to do it is already documented). It will also help you evolve your own plans once you learn from prior experience.

Now we will detail what the previously described principles meant for us in practice.

In action 🎬

We know that our situation was unique in many ways (more on that later), however by sharing how we used the principles described earlier, we can at least provide you with a framework to build upon.

Understanding why we wanted it

The developer experience team was composed of six software engineers, one quality assurance engineer (QAE), one product manager (PM) and one engineering manager (EM).

We had inherited a webhooks delivery system about six months earlier from a team that was tasked with handling other responsibilities. This other team had already inherited this system from some other team. This (webhooks being passed down from team to team) had been the norm ever since the system was created, around the time our company was founded. A lot of the people that had worked on the core functionality of webhooks were not in the company anymore (apart from our co-founder), so constantly reaching out for questions and/or explanations of parts of the code was not viable.

As new owners of the service, we were tasked with supporting it (the functionality it provides is central to many of our largest customers), making it more performant and evolving it to support new functionality. At the time of handoff, it was running on an old version of ruby that was reaching its end-of-life, most of the dependencies were outdated, the dashboards were populated with graphs whose purpose wasn’t clear (including many that didn’t have any data) and the logs didn’t help much. The production deployments, though done via Gitlab, needed action from the team to confirm it. There were many times when a developer would simply forget to activate the deployment and the code would sit in limbo for days or weeks before it went live in production. We had multiple occasions where the deployment of very large features happened together because no one had been keeping track of the deployment queue.

Faced with the goal of reducing our support workload and the prospect of evolving the system, we realized that we were accruing extra risk by not having a good grasp on our deployment process and that it was impacting our delivery speed. We needed to tackle that to be able to move faster and deliver value in a confident and reliable way. Continuous deployments were going to enable us to accomplish that, as it would require us to mature the system as we worked on it.

Visualizing it all

Due to the fact that we had inherited the webhooks system, we knew that we didn’t have a full understanding of its inner workings, design decisions, vulnerabilities or skeletons in the closet. More importantly, we didn’t know what we didn’t know.

As we had worked on the occasional support issue, we had built a minimal comprehension of the responsibilities of some components and how they interacted, but we lacked the complete confidence to reason about any medium or high level change — those were risky. All we could hold on to were the existing tests, which we didn’t know how much we could trust.

With all of that, we realized that we needed to take a step back and build an understanding of what we were dealing with, to stop flying blind. With more understanding at all levels, from the architectural components all the way to the input APIs, we would have the confidence that we needed to evolve the system. We needed answers to our questions.

Visualizations helped us achieve that. We built two visualizations that helped us get all of the data that we needed to understand what the situation actually was and what we needed to do to improve it. None of these are novel or groundbreaking ideas, these types of diagrams have been around for a while now, but only at this moment we realized the benefit of each (and the unlocks that both combined can create).

How we build and support software

The first visualization the team worked on was not system specific — it was a visualization of our ways of working. We built this by getting the whole team together and inviting everyone to look back at the last few months of work to think how we spent our days in order to create and support software.

After some deliberation, we drew out a very large diagram with a step-by-step description of each activity we did from ideation of features through on-call support. In the end, it looked like this:

Yes, it was very complicated

This exercise may seem completely unrelated to the idea of achieving continuous deployment — when we were working on it, it wasn’t clear to us either. This activity was not even planned as part of our “path to CD” work — we wanted to do this because we were looking for ways to bring alignment to everyone’s way of working and to reduce repetition. In hindsight, however, we had stumbled across a major unlock. As we mapped how we spent our time as a team, we visualized unnecessary work we could eliminate and what we needed to build into our routine to better support the business and our customers. We refactored our day to day to account for the extra risk that deploying this new-to-us system would create.

The ability to see on a single (albeit very large) diagram allowed us to understand how we could change how we spent our time to get to where we wanted to be.

The software map

As we thought about our path to get to CD, it was very clear that we needed to understand better what the webhooks system was. Despite this idea being very simple, it is not easy. The main difficulty was how to build that understanding. Reading through the whole codebase and/or drawing it out (via UML for example) was not enough, because it wouldn’t answer all of the questions that we had (e.g. what’s the most common failure scenario for this async job?).

We needed something more holistic that gave us the visibility of webhooks as a whole system within another larger and more complex system. If we were to be successful, it was adamant that we understand more aspects of webhooks other than just the code.

This is when the idea of a production readiness checklist came into play. These checklists have been around for a while and are a great tool for taking a step back from your system and deeply analyzing its current state. It will probe you to think about different criteria, such as how does your system on-call support rotation work, what are the areas of improvement, potential gaps, what’s the test coverage for every type of test, how you manage dependencies, what issues commonly arise, how you handle retries etc.

We built our own production readiness template (available here) based on the criteria that was important to us in order to allow automated deployments. With our template ready, we created a JIRA epic to track the whole project and in it, we added a task to fill out the checklist — it was later picked up by a team member. Once they filled out the whole checklist, answering all of the questions and adding links to other relevant documents, the team’s tech lead reviewed the document and created follow-up stories for every item that needed to be addressed.

Examples of the follow-up work included (but was not limited to):

  1. Add code coverage calculation to the build system (so that we could obtain reports and examine the deltas between changes)
  2. Write tests for specific endpoints and components (after we had coverage reports)
  3. Update dashboards with specific metrics that we knew were important
  4. Create missing monitors

These items themselves became tasks in our backlog that were prioritized and worked on alongside other work in our team. In the spirit of continuous improvement, as we progressed in our work, we noticed gaps in the checklist and iterated on it, adding new criteria that were later filled out and reviewed.

The checklist was a great tool to give us visibility in areas that we didn’t have before. It guided our investigation and provided clear pointers to where we needed to invest our time. It also allowed us to cut corners. For example: we didn’t have to fully understand the whole codebase from top to bottom to get to CD, we just needed to understand the areas that needed attention.

Have a plan

Despite our very best efforts and iteration on the checklist, the follow-up work from it, and ways-of-working map, we didn’t know what we didn’t know. We knew we needed a backup plan in case if (or, more likely, when) things went wrong. So we came up with contingency plans.

These were roughly broken down into the following categories:

  1. Release: these plans related to the release of the software. Examples of questions we answered: how do we roll out the new version? How do we turn feature X on for a certain group? What regions are affected by a change? What systems are affected at every deployment? What dependencies does this system have? What runtime version are we currently on?
  2. Production Recovery (aka runbooks): detailed how to solve an ongoing issue — these were often the hardest ones to come up with as problems could have a novel nature every time. Examples of questions answered: How to turn feature X off for a customer? How to restart the database? How to restart all of the servers? How to clear the processing queue? How to retry a request?
  3. Observability: showed us how to figure out the state of the system at any point by telling us where to look, how to look and how to spot abnormalities. These plans often relied heavily on metrics, logs and tracing to help supporting people build a picture of the current state of the system. Examples of questions answered: What dashboards exist? What is typical usage like? What does an error look like? How many items exist in the queue right now? How many database connections are open at this moment?

Having these in place gave us the ability to:

  1. Drive team cohesion and collaboration: multiple minds think better than one. By bringing the whole team into the discussion, we gained insights into possibilities that one individual might have missed at some point. Since everyone on the team was supporting the system, it was important that everyone is heard and contributes to the success of the project.
  2. Reduce impact and recover: Having mitigation plans in place gave us the ability to very quickly reduce the blast radius of a problem by putting all of the tools needed to detect issues early in front of us, and by knowing how to act on the information provided.

Our contingency plans relied on the usual suspects: runbooks and post-mortems. The first step was to assess the process (how they were created, when they were created) for these. We had already done this, however, when we mapped and refined our ways-of-working, so we were confident that we had the best process in place already.

Planning makes a big difference

The next step was to assess the what: what information the runbooks and post-mortems had or needed to have. For that, we iterated on the production readiness checklist, making sure that there was an item to review each one, another to make sure that they had all of the information that we knew at the time and, lastly, one check to verify that the runbooks and post-mortems would get prompt updates to it in case we discovered new information.

Lastly, the main focus of this piece of the work was just to align the team with the discipline to adopt the practices that we decided would make our releases safer regardless if they were done manually or automatically.

This may seem anti-climatic and overly simple, but the reality was that when we got to this phase, we were already set up with everything that we needed. We had done a thorough refactoring of our ways of working. We had already made sure that we had people trained in on-call support. We had ways to ramp up new engineers. We could act quickly and effectively to minimize the blast radius of any incident as well as drive its resolution. Because we had spent so much time with the production readiness checklist and the follow-up work from it, we had already made all of the improvements that we knew we could make.

The unlock here was not that we found a way to know what we didn’t know or that we found out how to achieve 100% knowledge of the system (whatever that may mean). The unlock was that we baked in dealing with the unexpected to our day-to-day so that when it happened, we were as equipped as possible to deal with it.

In the next few sections, we will cover a bit of the background of the organization and the team that created the right environment for this work and share some of the lessons and results.

Not everything can be taken for granted

It is important for us to recognize the unique circumstances that allowed us to be successful in this work. Unfortunately not all of these exist in all organizations, so we hope that mentioning them will help create awareness that a lot more than understanding and working on the software being deployed is necessary to get to CD. None of these are actually mandatory to, but missing any of them will definitely make the job harder.

We have also not listed every single thing that contributed to the work (e.g. we left out the fact that we’re using Ruby — perhaps the story would have been very different had we been using COBOL), choosing to focus on some aspects which add more value to this discussion.

Organizational culture

Checkr has a strong culture of ownership and transparency — we trust each other to do our work and hold ourselves accountable. As such, there’s plenty of flexibility to each team to define how they deliver the necessary business value. This means that the teams can build the ways they want to deliver their software as long as it fits with the business plans.

More important than all of this, however, is the understanding (at all company levels and roles) that unreleased software is wasteful and creates technical debt. There’s a clear understanding that we only aggregate business value as people use our software, which can only be accomplished when the software is live. The sponsorship from managers and executives makes the conversation of when to work on CD a full two way street, rather than just a bottoms-up push.

Automation software

Checkr has been using a continuous integration system for a really long time. This system already provides pipeline capabilities with different environments, gates between environments and execution of steps for promotion. All of that was already in place at the outset of this work, we, the developer experience team, just didn’t use the automation pieces for production deployments.

The same was true on the observability front. There was a lot of tooling available for metrics, traces, logs and error monitoring that we could leverage to get more insight into our services.

The existence of all of these tools is important to emphasize because the effort to get to full CD in our case didn’t involve the work on tooling, or lack thereof — we weren’t taking full advantage of the available tooling. For example: as explained before, there was still a lot of work to be done to get to a place where we could just activate the automation.

Feature flagging

As we worked on features or modifications that weren’t ready to go live yet, we never held back from merging our code to our main branch. We did so to be able to continue working in small batches, which are easier to review and less risky to deploy. These also reduce the likelihood of code conflicts, saving the team a lot of time.

Feature flags were a major unlock throughout

Our constant merges to the main branch were only possible because we made extensive use of feature flags via flagr, the open-source platform developed at Checkr. We won’t go into too much detail about this, because this in itself is a very deep topic and there’s already plenty of information about it online.

Lessons Learned

Most of the learnings that we had along the way and after this work were already described. In true agile fashion, we iterated continuously on what we were doing, so any slip was corrected and any lesson applied as we realized them, rather than at some fixed point in time. In practice, this meant editing our production readiness checklist, adding new steps to our on-call procedures, reviewing our dashboards etc.

In the subsections that follow, we will cover a few of the learnings that are more nuanced.

We don’t know what we don’t know

Recognizing what we don’t know, which can be easier when dealing with an inherited system, was a major unlock. With this mindset we were able to approach the project with a fresh and open mind, not having built-in preconceived notions or judgements. Facing this large challenge with this mindset can be stressful, but this acknowledgement helped us broaden our perspective and open it to understanding.

Automation is worth the investment

As mentioned earlier, before we started this work, there was already a lot of tooling in place throughout the company for deployment automation and observability that we were not using. When we were actually ready to turn on fully automated deployments for each repo, achieving that was a matter of a few configuration changes, rather than a whole project on itself.

Having these tools in place already saved us a lot of time. The fact that the company had previously invested in this kind of tooling was a major unlock for us, and for other teams in similar situations.

Dependency updates set us up for success

Early on in the project we decided to activate renovatebot in our repo to get an idea of how out-of-date we were in regards to the dependencies we carried with us. Renovatebot not only creates automatic update requests, it also builds a very easy to parse table (visibility anyone?) listing all of the updates available. With it, we were able to quickly grasp what we could easily tackle and what we wanted to punt until later (e.g. a build dependency update was safe at first since we knew it would break the CI pipeline without any customer impact, while an Object Relational Mapping (ORM) update was a lot more risky and we should wait until we have more confidence in our tests and our automated monitors).

In the end, renovatebot freed us up from worrying about dependencies and to focus on managing the software that we wrote. An unanticipated side effect of this process was that everytime we had to patch new security vulnerabilities in our library dependencies or move away from an end-of-life version, the action required from the team was minimal, because it meant either a minor or patch upgrade. In the rare cases that we needed to do a major dependency upgrade, we were in a way better place to do it.

Culture is crucial

We couldn’t get to CD without a culture supporting it. Engineers and management understood the importance of it and were keen to take on extra work and responsibilities to get there (e.g. more on-call responsibilities to update dependencies). Without that understanding, people might have seen it as more meaningless work and therefore shun it.

With team alignment, visualization maps and contingency plans in place, we moved all of the projects (legacy and not) in the developer experience team from manual deployments to full CD. For webhooks, we saw an increase of 15% in the number of production deployments quarter-over-quarter and a 10% reduction in lead time — measured as the time between a pull request being open and it being merged to the main branch. For another service, the changes were even starker: a 280% increase in the number of deployments and a decrease of 47% on change lead time.

These numbers per se could be read two ways:

  1. Engineers are more confident and have a better grasp of the systems on which they’re working
  2. Engineers are more reliant on the guardrails and less cautious when releasing code

To get more insight into this, we compared the number of production issues on our systems — the theory being that #1 would not cause a significant increase of errors whilst #2 would most likely mean more errors going live. The analysis of those metrics show that despite the large increase in activity in each repository, we did not see an increase in the number of errors causing customer impact. The number of production errors remained flat in most of our systems.

We are very happy with the results that we achieved. It was the result of a lot of work that is already setting us up for future success for the new work coming up.

Our goal with this long post was to share this story as a way to give tools to other people to achieve similar outcomes and tackle similar problems. We hope to inspire confidence if you’re facing similar challenges and hope that you can take some value from this material. Thank you for checking out our story!

--

--