The definitive checklist for production-aware deployments
Deploy to production but keep systems stable. Two goals at odds with each other, yet this is what engineers have to contend with. Especially the case when real customers are using your product!
Even Facebook, famous for it’s motto of “Move fast and break things” had to pivot to “Move fast with stable infra” (source).
This becomes important when the cost of an outage (reputational, lost revenue) outweighs the benefit of agility, that incremental speed your team gets by cutting a few corners.
So, how many errors are acceptable?
Nobody can tell you that, it’s a question only you and your team can answer. In fact, it’s a good thought experiment to define this number, it’s your Error Budget.
Fleshing out service level objectives and Error budgets are a way to make a conscious decision of when it’s time to act on the factors causing the errors. Chapter 3, ‘Embracing Risk’ of the Google Site Reliability Engineering (SRE) book dives deep into the concept of SLOs and Error budgets.
This checklist is borne out of a mixture of experience and lessons from the site reliability engineering (SRE) book google. There are a few references to chapters in Google’s SRE book, though it’s certainly recommended reading.
Sidenote — I’ve added some mid journey generated images (prompts included), hopefully not too distracting from the core theme!
Imaginary Scenario
Ask yourself the following questions. Depending on organisational maturity, the below may seem a relic from the distant past or a recent experience.
- Who is impacted when things break?
Example
Our most important clients could not login to the platform during peak usage.
2. How long does production break for? Why?
It took us 2 hours to fix the issue.
Why?
a. Error wasn’t noticed, clients started contacting us and that’s how we found out
b. We had to log onto machines to look at raw text logs to identify the root cause
c. We had to deploy a new version of the software to production and were obstructed by another pending release that needed a manual roll-back
3. Why did production break?
a. We deployed a breaking change (modified a db column, modified a rest method etc)
b. It wasn’t tested sufficiently
c. The change was rolled out to everyone
If this is your reality, before diving into other practices a implementing Blameless Incident Post Mortems will be a great step forward. Why?
A lot of the answers, will become evident to your team simply by introspection. Reflection on incidents, by going in a structured way through a checklist of questions surfaces insights. It serves as a place to capture critical details that can highlight trending issues. At the very least, it will generate a shared awareness and ownership of production in the team. It’s a practice discussed in more detail further down.
Let’s revisit the above questions. What can be done to avoid a repeat?
- Reduce the number of customers impacted by errors.
- Feature toggles
- Client aware feature toggles
- Blue Green and Canary deployments
- Roll out features to your less business critical customers first
- Deploy windows: Roll out features when your product is less used - Reduce the chance of an outage, or the duration that it affects your customers. This includes time to detection + time to resolution.
- Implement Observability and Alerting principles
- Make small & Frequent changes
- Ensure you have an automated deployment pipeline
- Have an automated rollback process
- Design with failure in mind
- Automated testing
- Make backwards compatible changes - When outages inevitably happen, learn from them so that there is a feedback loop to improved outcomes
- Blameless post mortems
- Know your downtime. Measure outages, identify trends and check if you’re within your error budget.
- Communicate your outages. Be transparent, have a status page that doesn’t rely on your infrastructure.
Production Deployments — Definitive checklist
- Feature toggles
(1.b) Client aware feature toggles - Blue Green deployments
- Canary deployments
- Gradual roll out — roll out features to your less business critical customers first, or by region
- Deploy windows: Roll out features when your product is less used
- Implement Observability and Alerting principles
(Golden Signals, Distributed Tracing, Log shipping) - Make small & Frequent changes
- Automated deployment pipeline
- Automated rollback process
- Design with failure in mind
- Automated testing
- Make backwards compatible changes
- Cross-functional collaboration
- Blameless post-mortems
- Communicate your outages. Be transparent.
- Decide on your SLOs and error budgets
1. Feature toggles
Feature toggles, as the name implies, allow easily turning on and off specific functionality in software.
if (FeatureToggle.IsFeatureEnabled("LightningPayments")
{
RenderLighningPayments()
}
Why do this?
- Reduce blast radius by controlling whether the new functionality is available in pro, yet to be proven in a production environment.
- Feature toggles relieve deployment bottlenecks by allowing to ship unfinished functionality, but keeping it disabled in production. This is particularly helpful when there are dependency chains.
- It also helps derisk a complex deployment by ensuring at the very least that e.g. new configuration or secrets are in place all the way through to the production environment.
- Revert functionality to a prior state, without carrying out a rollback. Simply toggle the feature back to disabled.
1.b Client Aware Feature Toggles
A variation of the above is to make a feature toggle “client-aware”. A set of client identifiers is mapped against a feature, for finer grained control of who a certain feature should be rolled out to.
if (FeatureToggle.IsFeatureEnabled("LightningPayments", clientId)
{
RenderLighningPayments()
}
This extends the binary feature toggle benefits further:
- Protect your most important customers from failures by enabling features for them only after these have been verified on a smaller subset of customers. This is particularly useful when there client segmentation is available. One could roll-out a feature gradually from their least valuable client segment to their most valuable one.
- A/B Testing — Evaluate the impact of a particular feature on any number of metrics and decide whether to fully enable or maintain the existing functionality.
2. Blue Green Deployments
It’s a deployment strategy which involves multiple enviornments for production. One environment is referred to as the blue environment, while the other is the green environment. The idea is to have only one of them, either the blue or green environment, live at a time. The other one being idle.
In principle,
- Ensure everything works.
- Make ‘blue’ the production environment by pointing your load balancer to it.
It’s not too easy in practice for stateful services. Of course, and there’s added complexity involved on stateful services
3. Canary deployments
Canary deployments are a method for exposing a new feature to an early sub-segment of customers (and also servers ideally). Feature toggles can help with this. Even when a feature is launched into production only to internal test users, it can validate configuration changes for example, before anyone is impacted.
A big benefit of canary deployments is that it limits the blast radius to a smaller set of users and if a sustained spike in errors is detected, the release can be rolled-back.
This is logic that can be automated in a deployment pipeline.
4. Gradual roll out
Rather than go big bang after the early sub-segment, continue a gradual roll out towards additional segments of users. Randomisation can include minimising the chance of impact to the 3 highest revenue customers of your product.
Another randomisation technique can ensure that it’s not the same users experiencing the early effects of a release. Another option is that it can be an opt-in by users themselves (beta version, early release version)
Taking steps to ring-fence a known, tried and tested configuration from one yet to be proved in production reduces the overal blast radius of any issues.
5. Deploy windows: Roll out features when your product is less used
Deploy times anti-patterns:
- Deploy just before you finish the day of work and head home.
Bad idea. It may feel natural to get done by the end of the day, but pushing to production before you sign-off from work is the start of something new, not necessarily the natural conclusion of a day of work. Monitoring - Deploy before the weekend. Like the above, but worse!
- Deploy during peak usage. This is like a roll of the dice, almost asking for trouble. Extra strain on the systems already
An automated release pipeline can help follow ensure some simple rules, such as
- Don’t deploy unless there is a team at hand that can monitor the release.
- Don’t deploy during peak usage for the geography that is being targeted. A typical application has fewer users overnight in the local geography. A ‘Follow-the-moon’ deployment and roll-out strategy for a global deployment lends itself well here.
Allow for overrides as the exception, not the rule and save your team from firefighting.
6. Implement Observability and Alerting principles
Don’t fly blind. Implement instrumentation from early on. It helps to know how your services are being used and how the system handles user growth.
A good start is having a unified view of the Golden Signals
- Latency
How long it takes to service a request, sucessful vs errors - Traffic
Demand on your website (e.g. http requests/sec) - Errors
The rate at which requests fail - Saturation
A measure of how ‘full’ your system is.
Log shipping to a system such as open search, helps have information at your fingertips when you most need it.
Distributed tracing, helps understand cause and effect better by getting a complete view of a request as it travels through the various services in your system. Open Telemetry is the standard of choice that supports this with support for many languages and exporters to zipkin and jaeger. Cloud native options also exist, such as:
Using an orchestrator such as Kubernetes, helps teams fall into the ‘pit of success’ with opinionated takes on logging architecture, metrics and traces
7. Make small & Frequent changes
The easiest way to reduce your blast radius, is to carry out smaller and more frequent changes.
- Small changes carry reduced risk: smaller, more manageable increments doesn’t allow risk to build up. It also makes code reviews easier to understand and manage.
- It’s easier to roll back smaller changes, than larger ones.
- Small and frequent releases allow teams to identify and fix errors earlier on. The whole process becomes more manageable in terms system complexity and breadth of domain knowledge needed.
- Shipping to production becomes a non-event, rather than something grand that needs meticulous planning.
- Team morale increases as does the business’ trust in the team, knowing they can deliver frequently.
8. Automated deployment pipeline
Minimal friction on the path to production is the seed for small and frequent releases. Automated deployments are the key here.
- They encode best practices and reduce the chance of error
- Guarantee consistency and repeatability of the process
- Automation means that meaningful metrics can be collected. For example, identifying higher churn services that need breaking down. Areas of contention, across teams.
- Automation is a prerequisite for measuring and evaluating using DORA metrics
Metrics worth capturing
a. Deployment Frequency
— How often can your team push a release to production?
b. Lead time for Changes
— The amount of time it takes a commit to get into production
c. Change Failure ate
— The percentage of deployments causing a failure in production.
d. Time to restore a service
— How long does it take to recover from failure.
9. Have an automated rollback process
Given that going back in time like some kind of time-traveller to undo a release isn’t quite feasible just yet, it’s worth having a process that makes releases reversible.
The impact of reversible decisions was something Jeff Bezos himself remarked in his 1997 letter to share holders (source)
Some decisions are consequential and irreversible or nearly irreversible — one-way doors — and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions.
But most decisions aren’t like that — they are changeable, reversible — they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgment individuals or small groups.
By making deployments easily reversible, they shift from gravitating towards being slow, careful and deliberate Type 1 decisions towards being more akin to quick type 2 decisions.
This in turn means that deployment decisions can be delegated to the team. This increases team trust, empowerment and ownership all of which fuel one’s purpose. A key ingredient for intrinsic motivation according to Daniel Pink’s book Drive (Link to Ted Talk)
10. Design with failure in mind
Failures are inevitably going to happen. Connections drop, servers can fail, buggy code brings a service down.
Designing for failure at the very least means being aware of the failure modes of the system and documenting them.
Taking it one step further, it means the the system can continue to operate even when these errors take place.
Common patterns include:
- Introduction of redundancy — deployment to additional servers, cloud availability zones or regions/geographies. Scaling out is a special case of redundancy, which means adding multiple services that do the same thing with the dual goal being to service increased load and to continue to operate when a service goes down.
- Introduction of isolation — Breaking down the monolith into Microservices or Cloud functions to reduce the blast radius of a buggy service and increase the isolation of a faulty component.
Circuit breakers double down on this concept by automatically detecting and isolating the point of failure. This reduces the chance of a misbehaving service causing a failure cascade.
Retry logic & Retry Storm Safeguards — You can’t assume that your dependency will always be available. Mechanisms such as retry logic are a path towards graceful recovery, allowing resumption when the dependency becomes available again. An architectural shift from request/response to async message queue based processing, reduces
Ben Maurer of Facebook has a pretty good keynote video and publication on “Fail at Scale” and mitigations (links: keynote, publication).
11. Automated testing
If you want to move fast without breaking things, you’re gonna have to test for functional (and non functional) regressions.
So more testing is always right? Not so fast!
It depends. As with most decisions, there are second order effects.
The more tests added to a testing suite, the slower it becomes. This makes the team think twice before pushing out a small incremental change and iterating quickly. The the cost for waiting for a long test suite to complete would need to be justified by a non trivial change to the code base. Left unchecked, this leads to big bang releases, which is exactly what we’re trying to avoid in the first place! It gets worse when there are flaky tests in the test suite, which can cause the test suite to fail, when in fact there are no errors. These are all important topics to address when embarking on a testing framework approach.
This is where the popular testing pyramid comes into play
Splitting test runs by duration and level of integration is what the testing pyramid refers to.
More isolated unit tests, tend to run quite fast, without needing to integrate with dependencies (which themselves can be another source of errors). Naturally, unit tests then form the base of the pyramid, as there are more of them, can run faster and in isolation.
The slower UI level tests, form the tip of the pyramid. They aren’t meant for testing exhaustively every possible permutation of business logic, which unit tests are meant to cover.
A more detailed writeup on how to approach the testing pyramid is beyond the scope of this already long write-up. Head over to read up about the testing pyramid at martinfowler.com
Beyond the testing pyramid, one can consider also when to test. There can be a further suite of automated tests that run in production to verify a configuration change against a test user. This would help as an automated validation step before a change is rolled out as a canary to 1% of traffic impacting real users.
12. Make backwards compatible changes
Reduce the chance of surprise to downstream services by avoiding breaking changes to existing contracts.
Non-exhaustive examples of breaking changes:
- Modifying the required parameter set for a REST method
- Using a serialiser/deserialiser that requires all properties to be present. When a data transfer object is extended, serialisation breaks.
- Modifying the name of a property or a data column
- and the list goes on...
These are all design choices. Teams can make design choices that reduce the chance of production errors, by ensuring that changes are incremental and backwards compatible.
Even if a breaking change does need to be pushed out, it’s best for it to be the exception rather than the rule. When the exception does happen, the clients should opt-in to it (by upgrading, pointing to a new location etc), rather than have a dependency suddenly break.
Even when you decide to bite the bullet and “change all dependencies at once” it won’t be without downtime necessarily. If your service is scaled out, the reality is that until all instances are upgraded, some instances will work and others won’t for the duration of the deployment.
13. Cross-functional collaboration
A lot of this comes down to having a defined (and rehearsed) incident response process. It’s unlikely to be overlooked in the heat of the moment.
When there’s an outage, it’s not only the right people need to coordinate to jump into action. Effective internal coordination and communication is key.
- Have an incident owner.
- Get the right people in the ‘virtual’ room. A slack channel (or similar) such as #war-room can help with the trigger.
- Log any actions being taken in a public channel, this will come in handy in the review.
- Think broadly — Don’t ignore the people outside the tech team.
For example
- Should marketing halt their google ads campaign, rather than pay for the clicks of people who will end up on a site that doesn’t work?
- Should the customer service team reach out to important customers?
14. Blameless post-mortems
The cost of failure is Education
The goal of the post mortem isn’t to apportion blame. It’s to a safe space, where the team can understand the ground truth of what happened, how the team reacted and what can be learned.
A good starting point is to establish the following
- What happened
- The effectiveness of the response
- What we would do differently next time
- What actions will be taken to make sure a particular incident doesn’t happen again
Facebook has developed a methodology called DERP (acronym from Detection, escalation, remediation and prevention — source)
- Detection. How was the issue detected — alarms, dashboards, user reports?
- Escalation. Did the right people get involved quickly? Could these people have been brought in via alarms rather than manually?
- Remediation. What steps were taken to fix the issue? Can these steps be automated?
- Prevention. What improvements could remove the risk of this type of failure happening again? How could you have failed gracefully, or failed faster to reduce the impact of this failure?
To generate conversation it’s worth sharing the timeline of events, root cause and notes on what happened in advance. The main outcome is to gain a shared understanding of the facts and improve as a team. These can be circulated in the form of:
- Follow-up action items.
- Lessons learned documented.
What went well, what went wrong, where were we lucky?
If a post-mortem is a one off event, there’s a risk that the agreed upon action items can be forgotten. There are ways around this (e.g. regular post mortem reviews, agreed priority of post mortem action items for product owners and teams)
Ultimately, it helps to foster a culture of learning from failure. Again, the google SRE book has a chapter on embracing such a culture, here
15. Communicate your outages. Be transparent.
Consider the audience.
- Internal stakeholders
- External customers
Channels of communication:
- A banner over your product highlighting areas of degraded service
- A redirect page to a status page
- A status twitter feed
- A dedicated status page service
- A teams or slack channel for internal stake holders
- Email updates
It’s good to have communication channels that aren’t intertwined with internal corporate infrastructure. For example, if you’re using enterprise security with slack and your ADFS server is down, then you’ve suddenly lost a critical communication channel.
16. Set your SLOs and Error Budgets
Your product has a service level, whether you decide to measure it or not.
Rather than turning a blind eye as to what that is, it’s worth measuring
ErrorBudget = 1 - Service Level Objective
There are many ways to reach this number. One approach is to take time based number.
How much downtime is acceptable over the course of a whole year? If your answer is about 9 hours or 45 mins a month, then you’re looking at a downtime of approx, 0.1% per year. The error budget is the inverse of your Service Level Objective, that gives an uptime objective of 99.9% “three nines”
If you operate on a local region and no http requests hit your systems at night, then this metric may not make sense. But instead aim for a % of failed http requests.
Keep in mind that the math of failed requests can throw assumptions off.
As an extreme example, let’s assume that you’ve defined a service with an SLO of 99.9% for a successful HTTP response and your product just about meets it. If your product makes 100 http requests on the home screen, then your visitor has a 0.999¹⁰ chance of receiving a fully successful response, or approx 90.5%.
If the visitor, attempts to visit your site 10 times (assuming no cached responses), the chance of a visit being a fully successful response drops down to just over 1 out of 3! (0.905¹⁰ = 36.85%)
In any case, defining an objective and knowing how close or far from it you are as an organisation, is a good way to quantify / prioritise work.
Key takeaway.
A key theme throughout this article is a focus on the following key principles:
- Automation (tests, releases, rollbacks, alerts, even remediation steps can be automated)
- Measure what matters
- Have a shared view of what a good level of service means to you, your team and the organisation. Reflect together, learn from mistakes and iterate to make them less frequent.
For further deep dives on the topics mentioned, consider reading up on the freely available Google SRE books