Deployment Pain: How to eliminate the suffering caused by infrequent deployments

Published in

Tide Engineering Team

13 min readSep 13, 2021

Even if you haven’t heard the term Deployment Pain, chances are that you have experienced this phenomenon. Unless, of course, you’ve been lucky enough to always work at companies that have followed the best practices in terms of getting their code to production. Unfortunately, the reality is that most of us are not so lucky, because most of the companies out there are still lagging behind to adopt these practices.

So what is Deployment Pain?

Deployment Pain is the stress and anxiety that engineers feel whenever they have to deploy their code to production. Have you ever worked at a company where the days before a production release and the hours after it were the most stressful time for you? If the answer is “yes”, then “congratulations” — you have already experienced Deployment Pain. But why do engineers at some companies feel so much stress around production releases? Let’s look at the main reasons.

What causes Deployment Pain?

Martin Fowler: Frequency Reduces Complexity

Bigger release payloads result in increased risks

The short answer here would be that infrequent production deployments are the main cause of Deployment Pain. The longer the period between your production deployments is, the bigger your release payload would be. Even a small team of 4–5 engineers can make significant changes in 2 weeks or so. The bigger your release payload is, the higher the risk associated with your changes going to production.

Deploying even a single change to production carries some risk with it — you can minimize the risk by following the best testing practices, but you can never fully eliminate it. Now imagine deploying 2 weeks' worth of changes, instead of just a single change — the risk of deploying that big payload would get tens, if not hundreds, of times higher. Higher risk production releases usually get scheduled during the hours that your system gets the least amount of traffic. This would often mean that those deployments would have to be performed out of working hours. Having to wake up at 6 AM or staying up late in order to do a production release is not the favorite thing for most people.

Risks materialize

Risk is not just a theoretical probability. Sometimes things do indeed go wrong. And if you have just released 2 weeks' worth of changes and your system starts misbehaving, it might be hard to quickly figure out which change is the culprit. If you are able to identify and fix the issue in a reasonable time, then you’ll most probably do a “hotfix”. But what if you can’t identify the reason or the fix can’t happen in a short time? Then you’ll need to rollback your release, which means that you’ll need to postpone deploying probably 2 weeks' worth of changes just because of a single offending change. And what if the deadline for some of the things you wanted to deploy was really close? What if this was your last chance to deploy before a “deployment freeze”? Those are quite common in companies that have periods with really high traffic that are really important in terms of revenue, so companies won’t take any risks during those periods.

Failed releases result in eroded trust

Failed(rolled back) production releases are one of the major factors that erode the trust between engineers and other departments in a company. Even if the specific failure didn’t have a customer impact, it would still mean that deploying certain features to production got delayed. You would often need to write an official report explaining why the release failed. Higher echelons of the company would anticipate your next release with even more anxiety. Some companies would make things even worse, by appointing a non-technical Release Manager or by getting senior leadership figures involved in approving or “signing-off” each release. This would mean that the engineering team would need to convince that non-technical person that the changes are safe to go. Releases would often get postponed because someone feels that the risk is high, but postponing them would make the risk even higher when you attempt the release in a few days or weeks.

You might even witness the absurd situation where a deployment that’s otherwise completely fine, gets rolled back. How is this even possible, you might ask? Well, I’ve seen this firsthand. This usually happens when soon after a release your system gets affected by a completely unrelated issue — the most common example would be a third-party misbehaving. Then due to the eroded trust and previous failures, the senior leadership might request that the release be rolled back unless you are able to quickly prove and convince them that the issue is totally unrelated.

A recipe for Deployment Pain

As a result of the points mentioned above, releases become a recurring “ritual” often involving multiple people from different teams. Instead of continuing with their day-to-day work, those people usually spend a day or two before the release preparing the deployment and testing the changes.

But scheduled releases put pressure on the teams even before the actual “ritual” starts. When a release has a fixed date and time, without there being one in another week or two, it means that if a feature misses the release deadline a.k.a. the “release cut” by just a day or two then it won’t go to production for another week or even two. So even slight delays in the development of a feature could lead to long delays in terms of getting it to your users. This obviously leads to increased pressure on the team, coming from various stakeholders that will always be concerned about missing the next release deadline.

Companies that have infrequent scheduled production releases usually use Git Flow. This means that before a release someone should spend time merging the development branch into master. In most cases, all the testing is done against the development branch, simply because the companies don’t have enough environments to test each feature branch separately. So if an issue is discovered while testing something during the last days before the release, that feature will have to be excluded from the release i.e. not merged from development into master. This would make the merge process more complicated and time-consuming.

If the release preparation involves manual testing, obviously the entire “ritual” would take more time. For some legacy systems building the production deployment artifacts could take hours. We won’t discuss extreme cases where companies still have manual steps in their release procedures e.g. engineers running scripts from their terminals. We hope that those things are just a distant memory as these days CI/CD pipelines are the norm and you should be able to deploy by simply clicking a button. If you have a formal change approval process, you’ll need to get a“sign-off” e.g. from your release manager.

Then the day of the release comes and you have to execute the release, hopefully by just clicking a few buttons in the UI of your CI/CD tool. You hope that the deployment goes fine. Then the team spends some time doing some smoke tests. If everything is fine you can then declare the release a success. If you notice an issue at this point, or worse, if your customers start reporting issues a few hours later, then you’ll have to do a “hotfix” or a rollback. And the vicious cycle starts all over again…

Solid data proves that infrequent releases are bad for your company

Relying on infrequent scheduled releases means that you are not able to deploy to production On-Demand i.e. whenever you want to, even if that means deploying multiple times per day. This means that delivering value to your customers will be slower and your competitors might be able to react faster to market opportunities. This would also mean that bugs will stay in production for longer because you’ll have to wait for the next scheduled release in order to deploy your fix.

These are not just some logical conclusions based on common sense — there is solid data and scientific research that supports this. You’ve probably heard about the DORA metrics — Deployment Frequency is one of the four key metrics that indicate the performance of a software development team. An excellent source of data, collected using solid scientific methods, is the Accelerate book. The authors’ research conducted over a period of several years clearly shows that organizations that are able to deploy to production On-Demand perform much better than those still relying on scheduled production releases. There is data showing that companies having a formal change approval process (e.g. a release manager signing-off releases) perform worse than companies that don’t have such formal processes and/or rely simply on code reviews. Moreover, the research shows that Deployment Pain is one of the main sources of Burnout in a company.

Why are so many companies still stuck with scheduled releases?

The most common reasons include the following:

Monolithic architecture

If you are still actively adding new code to your monolith, then most probably you won’t be able to deploy your changes On-Demand. Depending on the size of your organization, usually, at least several teams are involved in the monolith development. Coordinating their efforts and making sure that the monolith is always in a releasable state is a really challenging task. Multiple teams working on a monolith means that you’ll have dependencies between those teams and most of the time you’ll have code that is not ready to go to production, blocking an eventual release.

Even if you have fully automated CI/CD pipelines, most probably they’ll take a significant time to complete. Building a monolith is usually slow and running all the Unit, Integration etc. tests is even slower. I’ve seen monoliths taking more than 2 hours to build and probably there are more extreme cases out there. This is definitely not ideal for On-Demand releases.

Tightly coupled services

Using an architecture consisting of multiple services doesn’t always guarantee you On-Demand releases. You might even end up with a distributed monolith, which is the worst kind of monolithic hell. The most common example would be introducing breaking (non-backward compatible) changes to a service’s APIs, which would mean that one or more other services will have to be changed as well and deployed together with the initially changed one. Avoid breaking changes to your APIs at all costs!

Dependencies between teams

Both Monolithic architecture and tightly coupled services will result in dependencies between teams. Waiting for another team to complete something before you are able to deploy to production will guarantee that you won’t be able to release On-Demand. Another possible reason for dependencies between teams is the lack of clear ownership of services. Even if you don’t have a monolith, you might end up with several teams frequently changing the same service. This is usually a signal that your architecture or service ownership structure has some flaws.

Not using a proper branching strategy

Even if you have a decent architecture and teams working independently on separate services, you might still end up with dependencies between the individual engineers in a team. If your engineers often work on features in long-lived branches for days or even weeks, before deciding to merge, this is a recipe for increasing the release payload and eventually blocking On-Demand releases.

Not having proper CI/CD pipelines

Having automated CI/CD pipelines is essential for On-Demand releases. If you still have a significant number of manual error-prone steps in your release procedure you’ll be stuck with painful scheduled releases. Examples of manual steps would include engineers running scripts from their terminal, including Database ones, applying configuration manually etc.

Releases requiring a downtime

Needing downtime in order to release a new version of your service(s) is bad, even for scheduled releases. This most definitely means that the releases would be scheduled for hours that you have the least amount of traffic, which in turn usually means very inconvenient hours for the people performing the release. You’ll need to inform your customers that your system, or at least some part of it, won’t be available for a certain period. Having downtime puts more pressure on the people performing the release. Each minute of downtime costs money to your company and people usually have to rush in order to complete the release in the announced time frames.

The reasons for needing downtime for production releases could vary. It might be the case that you have just a single instance of a service or you might have multiple ones, but without a proper mechanism for rolling upgrades. Having to do non-backward compatible DB changes or DB changes resulting in high CPU usage or locking tables/records for a long period of time could also mean that you need downtime. Having downtime precludes On-Demand releases.

Manually configuring your deployment environments

Manually configuring your deployment environments increases the risk of failed releases and ending up with inconsistencies between the different environments. In general, you should aim to keep your testing, staging, production etc. environments as similar as possible. Applying changes manually makes this harder and slows down your releases.

Relying heavily on manual testing

Manual testing is slow and error-prone. If you are stuck with testing each of your changes manually, chances are that you will end up with infrequent scheduled releases.

How to eliminate the suffering?

The short answer here would be to just stop using the practices described above. Instead, try adhering to their exact opposites.

Loosely coupled services

Services communicate with each other by using their APIs. This includes both synchronous (e.g. REST) and asynchronous (e.g. commands/events) APIs. Those APIs and the services themselves should be designed in such a way that changing one service wouldn’t require changing other services i.e. the services should be loosely coupled.

If you need to change a service’s API you should do that in a backward compatible (non-breaking) way. If you really need to make a change that is not backward compatible, then this shouldn’t be done straight away and expecting all other services that depend on this API to change in a matter of days or weeks. Instead, you should have a deprecation period during which you’ll need to support both the old and the new versions of the API.

Needless to mention that you should avoid having deployment dependencies between your services — the deployment order shouldn’t matter and the service shouldn’t depend on other services being up and running in order to be able to start. In general, avoid coupling services in any other way like, for example, sharing a database.

Independent teams

If you already have loosely coupled services then you are one step away from having independent teams. You just need to have clear service ownership — designate a team that is the service owner and avoid having multiple teams working on the same service at the same time.

Trunk-based development

Your main branch, usually called “master”, should always be in a releasable state. In order to have On-Demand releases, you’ll need to make sure that at any point in time the number of not yet deployed commits in “master” is kept to a minimum. Preferably 1 or 0. If you allow that number to grow you’ll end up with a big release payload, carrying most of the risks that scheduled releases have. So after something is merged/committed into your main branch it should go to production as soon as possible.

Use short-lived feature branches. A feature branch should usually contain at most a day worth of work. This would make PR reviews easier and would result in a smaller release payload that will have to go to production after merging to your main branch. Don’t create any long-lived branches. Stop using Git Flow.

Use Feature Toggles

There will be cases when you would need to develop bigger features that would require multiple PRs during an extended period. Instead of blocking deployments from your main branch until the feature is completed, you can simply “hide” it behind a feature toggle. Doing this should prevent code related to the incomplete feature from being executed. After the feature is completed you can enable the toggle. Depending on the sophistication of the feature toggle solution, you might even initially enable the feature to a limited set of users.

Proper CI/CD pipelines

Merging a PR into your main branch should trigger your CI/CD pipeline, which should automatically build your service, execute your tests, perform static code analysis and deploy the service at least to your testing environment without any interaction needed from your end. Not even clicking a single button. CD stands for Continuous Delivery which usually implies that deployments are automated to the point where you simply need to click a button for your changes to go to production. Ideally, you should be able to go a step further and eliminate the manual “click a button” step and have Continuous Deployment. In order to achieve this, your pipeline will need to run some test suite on your pre-production environment(s) before continuing with the automatic production deployment. Only a failed test should prevent the pipeline from automatically deploying your changes to production.

Get rid of any formal change approval processes

As mentioned above, there is solid data that shows that formal change approval processes lead to lower organizational performance. Replace those with PR (code) reviews. Run some static code analysis tools in your CI/CD pipeline that should be able to spot things even before the PR review. Having proper monitoring and alerting on your production environment helps a lot. It should be able to notify you about any issues with your newly deployed changes soon enough, which will limit the time your customers will be impacted.

Zero downtime deployments

Whether you are deploying in the Cloud or on-premise, modern-day technologies make zero downtime deployments really easy. The most common approach would be to use Container Orchestration Frameworks like Kubernetes, AWS ECS, Nomad etc. You can even separate production deployments from releases i.e. you should be able to deploy your changes to production but make them available only to a limited set of test users initially. You can use methods like Canary or Blue-Green Deployments in order to achieve this.

Containerization & Infrastructure as Code (IaC)

By using Containers you can ensure that your services will always run in a predictable and consistent environment. You also need to be able to provide the infrastructure that your containers run on in a repeatable and consistent way. Using Infrastructure as Code (IaC) is your best choice. This will allow you to deliver new environments rapidly and could save you countless hours debugging issues caused by differences in your environment setup.

Minimize manual testing

Manual testing should only be used as a last resort. Use it only for high-risk changes that you don’t have the time to automate the testing for before the upcoming release.