How We Deploy Features For Columinity

Christiaan Verwijs
The Liberators
Published in
8 min readMay 6, 2024

--

Many years ago, I wrote a provocative post that argued that separate testing and acceptance environments (DTAP or OTAP) are wasteful. Many people loved it; others hated it. My view on the post has mellowed a bit over the years. However, I still greatly prefer simplicity over complexity, and DTAP pipelines can be home to gargantuan complexity.

In this post, I want to take a more personal perspective. I will explore how we deploy features for the Columinity (formerly “Scrum Team Survey”). I've been developing this software-as-a-service (SaaS) for the past years with Barry Overeem. It's grown quite a bit over time and matured into a professional platform with many customers. Since I believe you have to “eat your own dog food”, the question is: what does our deployment pipeline look like?

Recap: Why I Dislike DTAP/OTAP

My post was based on my observations of teams and clients that I worked with at the time. I noticed that their extensive DTAP environments created (long) queues, production releases that batched many smaller releases, and reduced actual responsiveness. It also led to team confusion as they determined which features had been deployed to which environment. Was the feature from their previous Sprint already running on “Acceptance,” or was it still in “Testing”? Here is how I visualized the issue with DTAP flows:

So, I described DTAP as an anti-pattern to Agility. Instead, I argued for a much simpler setup where — ideally — each feature is deployed directly to staging if it passes all the necessary tests. This removes much of the need for separate environments for testing and acceptance, which all need to be maintained, kept in sync, and coordinated. Such a deployment strategy also reduces the risk inherent to batched releases that include many features and changes, thus presenting more vectors where things can break unexpectedly. I visualized this like so:

I painted the picture of what such a team looked like:

Imagine a team that can reliably deploy a single feature to a live environment without disrupting it. They develop new features in a development environment. The team integrates work through version control (e.g. Git). Features are developed on a feature branch in their version-control system.

When a feature is done, it is merged or pulled into a ‘main’-branch in their version-control system. A build server picks up the commit, builds the entire codebase from scratch, and runs all the automated tests it can find. When there are no issues, the build is packaged for deployment. A deployment server picks up the package, connects to the webservers, creates a snapshot for rapid rollback, and runs the deployment.

The web servers are updated one server at a time. If a problem occurs, the entire deployment is aborted and rolled back. The deployment takes place in such a manner that active users don’t experience disruptions. After deployment, a number of automated smoke tests are run to verify that critical components are still functioning and performing. Various telemetry sensors monitor the application throughout the deployment to notify the team as soon as something breaks down.

If it's easy to release to production, there are also more moments to celebrate

I still support this message. However, I realize it's probably not feasible for larger-scale products in most settings. My experience with the simpler approach was based on a few product development teams I’d been part of. I shouldn’t have made generalized statements based on a small and biased sample. I lacked proper evidence to support them.

But do we practice what we preach if our money depends on it? Let's now dive into the Columinity.

A bit of background on Columinity

Over the three years, we have built Columinity to help teams drive continuous improvement based on data. We created the tool because we know from experience how difficult this can be. Where is improvement most important? What kind of improvements make sense? How do you track improvement over time and know when to celebrate? Here is a quick tour of the platform:

To date, over 28.000 people have used the tool, with 11.329 teams. We have over 50 paying customers across all time zones, many participating with 10 teams or more. Our largest customers use our platform with 50 or more teams.

Our ecosystem comprises 13 micro- and macro services and 7 support services (MariaDB, Redis, RabbitMQ, Exim, etc.). Whenever a new version of a service is released, it is built into a new Docker Container, pushed to our Docker Repositoritory, and deployed from there. Our codebase has over 8.000 unit tests, 300 integration tests, and 100 UI tests.

At the time of writing, our platform has had a 100% uptime for the past 90 days. We haven’t experienced major disruptions in the three years we’ve been online (fortunately!).

Our Deployment Pipeline

So what does the deployment flow look like for Columinity? Funnily enough, I did start with a separate staging environment to review features with Barry Overeem and other stakeholders before deploying them to production.

As I should’ve known, this staging environment started forming queues. I kept expanding the number of features in staging before deploying the whole package to production. This was risk-avoidance behavior on my part. I wanted to stabilize a little more, add one nice tweak, or add one adjacent feature now that I’m at it anyway. Before I knew it, I had two or more weeks of work running on staging but not production. It gave me an excuse to mark a feature as “Done” when deployed on staging. I also started to postpone releases to production because I worried about how those larger batched releases could break something that slipped through the tests, which made things worse. This wasn’t responsive. It wasn’t Agile.

I removed the staging environment, leaving me with only a local development and a production environment. So, what does the deployment flow look like for Columinity?

  1. On my development desktop, I run a Docker Host with a core selection of services that comprise our ecosystem. The containers I run are identical to those running on production but with a “debug” flag to increase logging and prevent outbound interactions (external API calls, emails, etc.). For the new feature I’m developing, I open the associated Visual Studio Project and let it connect to local services it needs (like RabbitMQ, secure token API, etc.).
  2. For each new feature, I usually start a new branch in GitHub named after that feature. I do my work there and pull the results back to the main branch when all automated tests pass, and both Barry Overeem and I are satisfied with it. Ideally, I try to implement a feature in 2 days or less, but I accept up to 5 days. Beyond that, I refine the feature to make it fit and push the rest of the work into the next iteration. If I really can’t do it within 5 days, I put the feature behind a feature flag.
  3. I release low-impact features to production as soon as all tests pass after I merge the code back into the main branch. This is as simple as pulling the latest version of the Docker Container and restarting the associated Docker Service. Features with a larger risk profile (like database migrations and multiple impacted services) are deployed during low-traffic hours on the weekend, so I don’t have to worry that a deployment issue disrupts active users too much. Deployment issues are fortunately very rare and quickly resolved.
  4. We run our Sprint Reviews on production. The upside of this is that participants of our Sprint Reviews can immediately use a new feature after they participate. It also allows us to see what it looks like with production data.
  5. The new week usually starts with a blank slate. I select a new feature I want to add and repeat the process.

Typically, it takes between 15 and 30 minutes to build a new release from our “main” branch and publish it to production.

Benefits And Caveats

What I like about this approach is that it has very little overhead. The main branch in Github always has a running version identical to production, except for a brief period when I merge my work back in. I can easily switch between features by switching branches (although I believe that multiple active feature branches are an anti-pattern). Furthermore, I’m always running a selection of production services on my local machine (just not with production data and a “debug”-flag), making replicating issues easy.

Builds are pretty fast. It usually takes between 15 and 30 minutes to build, test, and deploy a new version.

This flow also allows me to respond quickly to emerging bugs and issues. I only have to create a branch for the bug from “main,” fix the issue, and merge it back into “main.” Then, I pull the changes into the feature branch and continue work. It makes us very responsive to user needs, and I love that. It also saves me a lot of time because I don’t have to keep separate environments in the air and coordinate releases.

A downside of this approach is that I occasionally experience queuing when I complete more than one feature with a higher risk profile in a week. In that case, the deployment makes me more nervous because I have to keep track of multiple changes deployed simultaneously. Fortunately, I’ve never had significant issues with deployments, so I probably should be more confident than I feel. I greatly prefer many small incremental releases (like multiple a week) over a larger one during the weekend. For now, this is manageable, though.

So overall, it‘s pretty close to the ideal scenario. But with these caveats:

  • I essentially develop Columinity alone. To use this approach with more developers, you need strong Git and branching discipline. You also need a high degree of cross-functionality so that all members contribute to (preferably) one feature. I’ve used a similar approach with some teams I’ve been part of, and it worked well there, but it requires a lot of practice.
  • The local development environments require some set-up. I have one running on my desktop and one on my laptop (for on-the-road work or emergency support). I’ve scripted much of the setup in Powershell. But you must install Docker locally, run the script, and seed the necessary data to run the services.
  • If non-developers want to review or test a feature, they also need a local development environment. This will require further scripting to simplify the process. Another option is a shared development environment to which such users can access. However, if teams don’t work on a single feature together, it will be hard to keep that environment stable.

Closing Words

Several years ago I wrote a provocative post that argued that teams are better off without DTAP and OTAP. In this post, I shared how we deploy features for Columinity, a software-as-a-service platform I’ve been developing with Barry Overeem for the past three years. In short, I think we stick quite closely to the ideal scenario. But I also realize it isn’t reasonable to expect this at a much larger scale.

See how you can support us, at https://patreon.com/liberators.

--

--

Christiaan Verwijs
The Liberators

I liberate teams & organizations from de-humanizing, ineffective ways of organizing work. Developer, organizational psychologist, scientist, and Scrum Master.