Zero Downtime Deployments and the Robustness Principle
A deployment is only as strong as its weakest dependency
Suppose that you want to make a seemingly simple change to your domain model, such as changing the name of a field. You have a SPA front-end, a BFF service, a lot of rows in your database, and downstream services consume your events. How do you perform a zero downtime deployment? You can’t deploy the front-end and back-end at the exact same time. Regardless, the SPA is cached at the edge and the browser, so you have to bust the cache. Plus, there is so much data the conversion has to be performed incrementally. Even if you could convert it all quickly you still have to time that with the deployment of the back-end. And what’s more, you can’t expect all downstream services to upgrade on your timetable.
We need two things: the robustness principle and a deployment roadmap.
The robustness principle says: “Be conservative in what you send, be liberal in what you accept”.
How do we translate the principle to this scenario?
Be conservative in what you send. The back-end cannot return a field that the front-end does not understand. The front-end cannot send a field that the back-end does not understand. The back-end cannot save a field to the database that the event publisher does not understand. The publisher cannot remove a field that downstream services expect or send one they don’t expect. And the back-end cannot remove a field that the front-end expects.
We can’t let any of these cases happen. Otherwise a user might experience an error, which is not zero downtime. This seems insurmountable unless we practice the second half of the principle.
Be liberal in what you accept. The back-end can handle receiving either field from the database or the front-end. The front-end can handle receiving either field from the back-end. The publisher can send both fields until all downstream services have upgraded and the downstream services can ignore fields they do not understand or handle either field. I like to think of all these cases as natural feature flags. The field name is the feature flag.
From here we have choices. We can start from the front-end and work backward and downstream. We can upgrade downstream first and then work upstream. Or we can start with the back-end and work outward in both directions. We can do some combination. And then finally we need to double back and remove the technical debt of the natural feature flags.
So we need a deployment roadmap. We need to craft the preferred roadmap and review it with the team and any downstream teams. The more risky the change the more we need to double and triple check the roadmap. And we don’t want a protracted roadmap, otherwise it is likely that we wont double back and remove the technical debit.
I refer to this in my books as a task roadmap and I use a task-branch workflow to execute the roadmap. The need to change the domain model is a story. Then each change to the various components is a task that gets its own branch in the respective git repositories. The roadmap might look something like this:
- Submit pull-requests for any downstream services that do not ignore unrecognized fields.
- Update the publisher to except either field from the database and send both field names downstream.
- Add the new field name to the graphql schema and add natural feature flags to the mutations and queries: to accept either field name from the front-end and write it to the database with the new name, and accept either field from the database and return it in whichever field the front-end requested, respectively.
- Update the front-end to send and receive the new field name
- Wait for the data to naturally convert or write and execute a conversion.
- Remove the old field name from the graphql schema and remove all feature flags from the queries and mutations
- Submit pull-requests to all downstream services to use the new field name
- Remove the old field name feature flags from the publisher
Many of these tasks can be developed in parallel, but each task is approved and deployed to production sequentially. With the cooperation of downstream teams, this whole scenario could be rolled out in a matter of hours by an experienced team with a dozen or so extremely small batch sized production deployments. All with zero downtime, right under the users’ noises, with no disruption.
For brevity I excluded regional canary deployments. I will leave that for another post here.
This was a relatively simple example. But it is representative of the challenges of continuous deployment and delivery that render old approaches obsolete. Today’s distributed systems require small, focused and well ordered deployments unless you are willing to disrupt the end users.
Note that the BFF service in this example is an autonomous service. These services have well controlled dependencies which leads to much more predictable deployment roadmaps.
For more thoughts on serverless and cloud-native checkout the other posts in this series and my books: Software Architecture Patterns for Serverless Systems, Cloud Native Development Patterns and Best Practices and JavaScript Cloud Native Development Cookbook.