Micro operations — A new operations model for the micro services age

Published in

keptn

10 min readApr 17, 2020

tl;dr In alignment with our development of micro services applications, which consist of many individual services that are shipped frequently, modern operations teams need to move away from complex manual operations that affect entire application environments. “Micro operations” enable you to ship dedicated sets of operation instructions along with all delivery artifacts. Ideally, the artifacts are machine-readable and can be executed automatically.

Micro services and cloud native architectures have changed the way we build and deliver software. The next step in this evolution will be a change in how we operate software. The biggest focus over the last two years has been the changes in delivery best practices toward continuous and progressive delivery. While we’re still in the early days, our industry has started to develop a common understanding of delivery best practices.

On the operations side, however, many companies still rely on best practices that were developed for running multi-tier applications. These practices depend on a number of assumptions that aren’t applicable in modern architectures.

Outdated Operations assumptions

Operations processes don’t change frequently.
There’s only one version of an application in production.
The root cause of any problem is likely constituted by a small number of system components.
Remediation and operations workflows are difficult and time consuming to test.

The impact of frequent changes

Cloud native applications are usually released at a higher frequency than traditional applications. Additionally, individual components are released independently. This leads to a much higher frequency of change that is way too high for a traditional operations workflow.

Let’s look at a simple example (see diagram below), a traditional three-tier application where all components are released as a single batch once a quarter. With a cloud-native approach, these components are broken down into about 20 microservices (in reality, the number might be much higher). Each component is now released independently each week (which is still a fairly slow release cadence in the cloud-native world). Now, instead of one release, there are 80 releases (20 services X 4 releases).

Most traditional operations workflows aren’t built to support such rapid releases. This leads either to outdated release cycles or to massively overwhelmed Operations teams. What makes this problem even worse is that many developers don’t have access to the tooling they need to model their operational workflows. This makes aligning releases of remediation actions and delivery artifacts even more difficult.

Increase of release frequency and number of artefacts

Operations with multiple versions in production

In the cloud-native world, running multiple versions of the same software is a common practice. Your organization might run two versions deployed in a blue/green mode or possibly run a new canary release in addition to a stable release version. In more advanced cases you might also use feature flags to enable specific functionality for specific user groups.

Traditional operations processes assume that there is only one version of an application running in production and that there is only one version of remediation and operations procedures. In cloud-native environments, remediation is fundamentally different. The first step is to identify which version (or feature flag combination) is having issues. Then the proper resolution has to be identified:

If this is an ongoing release, the release will most likely need to be rolled back.
If this relates to a specific feature, the feature flag will need to be disabled.
If the issue is related to a spike in load, the service may need to be scaled up or a circuit breaker may need to be activated.

However, to make the situation even more complex, there are often permutations of the above situations. For example, a problem might be with a release for which you only have the ability to roll back one specific feature.

Cloud-native execution models are more dynamic than traditional execution models

Encoding these complex situations in traditional workflow-based operations automation tools is almost impossible. Not to mention, the resulting workflows are complicated and difficult to maintain. The result is similar to that of delivery pipeline automation challenges in micro services environments. For more details on this, see my earlier blog post, How your delivery pipelines become your next legacy code

Complex root cause detection and dynamism

In the previous section I gave you a glimpse of the complexity of operations tasks in cloud-native environments. In multi-tier environments you have only a small number of components that can be configured and the effects of changes are largely predictable.

In modern microservices environments, however, you’re dealing with systems that can expose unpredictable behavior due to the high number of interdependencies. Changing the configuration of one component might have an impact on a totally different part of the system.

You also need to understand that problems aren’t static. Often when people talk about problems in software systems they talk only about one specific point in time. The reality is different though. Problems evolve over time and are often dynamic. The nature and impact of a problem can also change drastically over time. Remediation actions can also impact how a problem manifests. Assume for example that you’re scaling up a service to cope with additional load. While this may solve your immediate problem, it may also result in an increase in traffic to other services which will in turn become the root cause of a new problem.

Encoding these complex workflows with a traditional script-based operations tool is impossible. Even if you tried, you would likely end up with a script that’s more complex than the actual application logic, and you’d have a maintenance nightmare on your hands.

Dynamics of problem evolution in large scale environments

Testing operations workflows

Test-driven development has become a best practice in modern software development. This is however usually not the case for operations automation code development. One obvious reason is that operations workflows are still frequently only written down as runbooks on wiki pages. Testing these manual instructions is time-consuming as all steps are manual. There’s also often no dedicated environment available for testing operational procedures.

If runbooks are already available as scripts, they can be tested more easily. In most cases, however, testing only verifies that the runbook executes properly (for example, that servers are scaled up) but not if the remediation actually solves the problem it was designed to solve. One reason for this might be that the problematic scenarios that are to be resolved are usually harder to create than the environments in which they’re tested.

While modern concepts like chaos engineering exist for this very purpose, they don’t provide a standard toolset of software delivery for verifying operations automation.

Micro operations — The core principles

Now let’s look at the concept of micro operations. The term micro operations sums up an approach to addressing the problems described above and for up levelling operations automation to a modern cloud-native world.

Declarative operations as code

The first and foremost principle is that in a modern cloud-native environment everything must be machine-readable. This means that operations automation must be written in such a way that it can be interpreted by an automation component.

As always, you have two approaches for this challenge. You can follow an imperative approach and write scripts for operations logic execution. This comes with the drawback of reusability and high effort because a script must be created and maintained for every problem that a service might encounter.

The alternative is to use a declarative approach, which only defines what needs to be done and leaves all the details to other components. Eventually you would abstract the functionality into functions anyway, so why not take this approach from the beginning. This approach also follows the operator pattern that’s used prominently in Kubernetes.

Below is an example of a declarative remediation file as used in Keptn. The file defines two situations and the respective remediation actions. In case of a response time degradation, new server instances are scaled up and in the case of an increase in failure rate, a new feature is disabled.

remediations:  - name: “Response time degradation”actions:  - action: scaling      value: +1  - name: “Failure rate increase”actions:  - action: featuretoggle      value: EnablePromotion:off

These remediation actions are defined by the developer for all artifacts (i.e., containers images) that are created. The operations instructions become additional metadata for each artifact.

Using a declarative approach, there is no need to worry about the actual execution details. Developers can leave the details to the platform engineering teams while leveraging the functionality.

Principle — Atomic building blocks

As mentioned above, the processes that need to be executed for remediation in cloud-native environments are hard to capture in static automation scripts, and such scripts are too complex to manage.

Micro operations are called “micro” because they define operational procedures not for an entire application but rather for individual microservices. Declarative operations procedures are written on a per-microservice basis. This provides you with atomic actions, which you can select and combine as needed. With this approach, we’re creating a defacto microservice architecture that follows the law:

Your operations model and procedures should follow your application architecture and deployment strategy.

As we have these atomic building blocks now, we need to figure out two things:

How do we identify which component needs to be fixed?
How do we orchestrate multiple actions that need to be executed?

The next principles will help us answer these questions.

Monitoring, tracing, and AIOps

Understanding which components are broken requires good monitoring and anomaly detection for each individual service as well as deep understanding of different releases.

All we need to know to understand the core telemetry values of a service are the underlying infrastructure metrics. Anomaly detection on top of these infrastructure metrics will allow us to understand whether a component is working properly or not. Open source tools like Prometheus can provide us with these metrics.

However, just having these metrics won’t do the trick. While we can find out which components aren’t working, we lack insight into which component is the actual root cause and which component is simply impacted.

Let’s illustrate this with a simple example. Say we have three components: A, B, and C. These components form a call chain where A calls B and B calls C. When C is having issues, we see that A and B have issues as well. If we only looked at monitoring data however, it wouldn’t be clear that we only need to fix service C and that the other two services will recover on their own. As we receive alerts for all three services, we need more information.

Service dependencies resulting in “ripple effect” failures

This is where tracing comes in. With distributed traces we can create the full dependency chain of services. Tools like Jaeger or Kiali can provide this information.

Service flow based on distributed traces

In the next step we need to combine the information and analyze the causality chain we discovered via traces. The result will show that service C is the one that needs to be fixed.

Event-driven choreography

Now we need to execute the proper action based on the identified root cause. We have two choices for implementing this:

Using orchestration with a central process that is executed and where services are called as needed.
Using choreography without a central process where individual services subscribe to and take action on messages.

Orchestration has the same limitations as when defining a centralized operations process. Therefore a choreography approach is much better suited for this problem. We now need to agree on a set of well-defined messages and a mechanism for subscribing to events.

Luckily we’ve already defined our subscription mechanism. Our remediation-as-code files have all the information available for subscription. The service name is implicitly available as it can be derived from the shipped service artifact.

As part of the Keptn specification, we’ve also defined a message that represents the problem with the analyzed root cause information

{  “type”: “sh.keptn.event.problem.open”,  “specversion”: “0.2”,  “source”: “https://github.com/keptn/keptn/prometheus-service",  “id”: “f2b878d3–03c0–4e8f-bc3f-454bc1b3d79d”,  “time”: “2019–06–07T07:02:15.64489Z”,  “contenttype”: “application/json”,  “shkeptncontext”: “08735340–6f9e-4b32–97ff-3b6c292bc509”,  “data”: {    “ImpactedEntity”: “carts-primary”,    “PID”: “93a5–3fas-a09d-8ckf”,    “ProblemDetails”: “Pod name”,    “ProblemID”: “762”,    “ProblemTitle”: “cpu_usage_sockshop_carts”,    “State”: “OPEN”,    “project”: “sockshop”,    “stage”: “production”,    “service”: “service”  }}

Usually, there’s a one-to-one match between messages and subscriptions. Individual services should not have multiple subscriptions for the same problem as this would make it difficult to decide which actions to execute. This problem can be solved using priorities. However, in most cases, this makes the process more complex without adding sufficient value.

Test-driven automation

The last principle of micro operations is test-driven deployment of operations automation. As mentioned above, the testing of operations automation is typically a manual process that’s often only performed to validate that an operations process works rather than checking whether it actually solves the problem.

In a micro operations world we can easily automate the testing process for operations automation. Chaos engineering is an ideal approach. We can look at the remediation definition and then create the proper chaos testing experiment to trigger remediation actions and validate whether they work properly.

If the validation of remediation actions doesn’t work, we can automatically stop releases from being pushed into production.

Keptn as a micro operations environment

All of the concepts described above are more than just an idea. With Kept, we’ve built a working version of a control plane for Kubernetes, which supports the concept of micro operations. Below is a recorded video example of an operations workflow based on the concepts described in this article.

Photo by Magda Ehlers from Pexels