Ops Mitigation Triangle

— Consider these options

Let’s ignore the elephant in the room, the diagram above, and consider a scenario.

It’s crazy o’clock in the morning and you are the on-call engineer. An alert has been triggered, you have checked the logs and you are aware of the rough area affected. Time to take action! Your immediate focus is to mitigate the issue. What tools do you have available?

This is where we can address the diagram and go through what (I think) are the best options to have in this situation. The ordering of these tools is based upon the order of your train of thought as the on-call engineer.

Feature flags

If you are new to feature flagging you can read about them in Martin Fowler’s post. This powerful mechanic opens many doors and comes at a heavy price. The basic idea is that execution paths of your software are covered in conditional statements which can be changed quickly. Here, “quickly” means almost immediately, for example, a websocket event to a web page, changing the content on the fly. Another usage can be reading the flag value from a configuration DB table or file.

As you can imagine, the conditional logic has to be implemented manually, usually together with the feature itself. This incurs slight, but ever-increasing technical debt. Without careful management, the codebase will end up a convoluted spaghetti mess of feature flag logic. It would be wise to feature flag shrewdly. The “Where to place your toggle” section in Martin Fowler’s post should give you some ideas.

Back to our on-call scenario.

If the feature causing the issue is wrapped in a flag, you’re in luck!

A simple switch of the value should immediately (within milliseconds) mitigate the problem. This makes feature flagging a key player and your first pick in your tool belt of mitigations.

As a bonus, a lot of feature flagging solutions include the ability to not only turn the flag on or off but turn it on based on a percentage. This allows for a canary release of your features.

An excellent resource on feature flagging beyond the mentioned blog post:

  • LaunchDarkly — SAAS based feature flagging solution.
  • Flagr — Flagr is an open source Go service.

It is highly recommended not to roll your own feature flagging solution, as what seems like a simple service is very likely to spiral out of control into a rabbit hole of potential edge cases and performance issues. This may lead to the use of feature flags actually becoming more of a burden and start outweighing the benefits.

Blue/Green deployment

Blue/Green deployment is yet another idea emerging from Martin Fowler’s blog. The fundamental concept is based on these prerequisites: you are able to run multiple concurrent instances of your application in isolation and route traffic to either instance or group of instances.

If you separate the instances into a “blue” and “green” group and make sure that only one group can serve traffic at a time, you can then deploy changes to the non-routed-to group, test the changes, and flip the routing. Thus having a pre-live buffer for your changes.

In terms of mitigation, flipping the traffic back to the previous colour is as close as you get to an “undo” action.

The speed of the switch should be as fast as the feature flag because both blue and green instances should be running and the router change should only a be a configuration reload.

A pitfall with Blue/Green deployment is not fully automating the process. Whether it’s the creation of environments, running the tests or changing the routing, having a manual step in any of these areas can cripple the entire release process and render the Blue/Green deploy harmful to your value delivery stream. Needless to say, when it comes to problem mitigation, a manual Blue/Green deploy is not an option.

Once this automated mechanism is in place, most likely as part of your deployment pipeline, it doesn’t require changes, unlike feature flags. This makes it a lot cheaper in terms of dev effort. A must-have for your mitigation toolset.

The implementation of Blue/Green mechanism will highly depend on your deployment and infrastructure. Here is an example using Kubernetes labels.

Re-deploy

To achieve the desired effect, your deployment strategy must include a degree of idempotency. This means that whenever your application is deployed, a fresh environment is created every time, whether this is achieved via containerisation or server instance manipulation.

The desired effect we are aiming for here is the power of turning it off and on again.

In context, this can be seen as a “roll forward” while Blue/Green deploy would be a rollback. The major difference is that the aim is to provide a rebuilt idempotent environment. This can help alleviate problems with built up transient state of your application, for example, memory leaks.

The key to this mitigation strategy is to achieve this as fast as possible. The rebuilding of the environment (and potentially the application instance) is never going to be as speedy as a flag toggle or routing config flip. The aim here is to decouple the deploy step from the main code delivery process. Whether you have a full continuous delivery pipeline or not, to have an effective re-deploy mitigation is to have the ability to manually trigger it in isolation without the need of external changes, for example, pushing any code changes.

Investigate, fix, deploy

When all else fails, there’s nothing to it but to actually debug the problem and apply an actual code fix. In the context of mitigating issues, this seems like a defeatist statement, however, all of the above mitigations address the same type of problem…

The above mechanisms only address issues which are stateless in nature.

If you have applied a schema change to a database which causes an outage, feature flags, Blue/Green deploy or re-deploy of your application will not help here. There are strategies you can employ like backing up snapshots of your database as you deploy, but ultimately mutation of state will find a way to screw you over. The best option here is changing the way you make changes to your state. The most effective example is non-breaking backwards compatible changes, such as additional columns instead of changing the existing ones. This applies to a service-oriented architecture and management of API changes.

Conclusion

Now the subtitle of this post does say “options”, but in my opinion, these four mechanisms (or actions) are almost crucial for the peace of mind of the on-call engineer.

It is also interesting to consider all of these together in sequence, rather than in isolation. Each has a specific set of advantages and costs. The instant ability to toggle features vs the ever-growing dev effort to manage feature flags and the price of the flagging service. The chance to run tests against to-be-live of Blue/Green instance vs the cost of implementation of the automated deployment couple with simultaneously running two versions of the application, and so on…

A common theme is, to reap the benefits of these techniques to you have to implement a high degree of automation.

Another interesting point is to find other mechanics which might fit on the triangle diagram. For example something less costly than a Blue/Green deployment implementation but faster than a full re-deploy.

Consider adding these mechanisms to your services and making them clearly visible and usable in the README. The poor soul having to solve problems late at night will thank you.