Share the Responsibility: Against Automagical Solutions

AbdulRahman AlHamali
3 min readMar 8, 2022

--

Let’s imagine a scenario together:

You were assigned a task to implement world-class distributed tracing for your microservices. You researched the market and found quite a few options, Dynatrace, Datadog, etc., but you found out that no matter which option you choose, a certain library will need to be installed on every microservice on the cluster in order to collect those traces.

So, how to do that? The answer is far from technical; the technical side is already covered by the library documentation. But the question is how to achieve that:

  • With minimum interruption to the micro-teams workflows
  • While respecting the change management requirements of those teams
  • Without having to manually submit 300 pull requests to each service every time the library requires an update
  • Without having to remember to instrument new services that are created later

These are some of the concerns to consider when making such decisions, and there is no silver bullet that satisfies all of them perfectly, but to go on with a solution that favours some of them while fully disregarding the others is a big no-no…

For example, say that you decide to inject this library using a mutating webhook in Kubernetes. That is, whenever a pod is deployed into the cluster, your webhook mutates it and injects the library into the pod.

This is a great solution at first sight; it causes zero interruptions to the teams, and whenever the library needs to be updated, you can take care of that in a centralized way without needing company-wide communication.

But…

This solution does not respect the change management of the teams. What if:

  • The specific service is positioned in a special area that requires special scans for all of its libraries. And there comes you injecting a library that has only been scanned with the standard scans.
  • The specific service has used up its error budget and is not allowed to deploy new updates. And there comes you updating a library in their service to a version that has an unexpected memory leak.
  • The specific service communicates with an external endpoint that only accepts specific HTTP headers. And there comes your library adding extra headers (pretty standard in the distributed tracing world) and breaking communication with the external endpoint.

The mutating webhook solution is nice. It is neat and hands-free and magical. But that is a drawback as much as it is a benefit and this is because it does not take service team autonomy in mind.

And the solution to the problems above is simple: Share the responsibility…

Share the responsibility

One of the biggest problems in the DevOps Enablement world is that we try to make everything magical for the micro-teams. But this usually ends up with them incrementally losing autonomy, or more precisely transferring it to the DevOps Enablement teams who end up with more autonomy-per-capita than their capitas can handle.

The whole Mutating Webhook solution could be implemented exactly the same, but with opt-in flags. Do you want to opt-in for this library injection? Or would you rather add it yourself? Do you want to opt-in for automatic updates? Or would rather set a static version and update it when you see fit?

This is slower, requires some interruption to the teams, and might need more communication when urgent updates are needed, but it balances the different concerns above instead of maximizing some of them while neglecting the others.

--

--