AIOps: Simple Anomaly Detection in Kubernetes with Active-monitor and Tensorflow

Shrinand Javadekar
keikoproj
Published in
3 min readJan 4, 2021

Written by Ravi Hari and Shrinand Javadekar

Active-Monitor has the HealthCheck Kubernetes CRD for deep cluster monitoring and self-healing using Argo workflows. It provides ways of performing health-checks of various components involved in the infrastructure setup. Some examples of this are periodic testing of dns-resolutions, periodic testing of IAM capabilities, testing access to external cloud resources such as databases, validity of SSL certificates, etc.

These health-checks are very good at monitoring for failures in features and components that are *not* commonly exercised. However, there is a whole other type of monitoring that can be done with active-monitor by measuring the time required for HealthChecks to complete. In these cases, the component being tested might be working but it might be in some degraded state and might break down eventually.

This blog shows one way of identifying such problems using Active-monitor and anomaly detection using Tensorflow.

The basic premise of the idea is as follows:

  • Create an Active-monitor Healthcheck that exercises some functionality in the Kubernetes cluster
  • This HealthCheck should measure the time required to complete
  • Every time the test completes, the time taken for the test should be run through a machine learning model to check if the new time taken is anomalous.
  • If the time taken is anomalous, the HealthCheck should fail.

First, let’s create a simple HealthCheck:

This HealthCheck does a basic test of the Kubernetes control plane by:

  • Creating a pod
  • Exec into the pod and running a curl command
  • Deleting the pod

Now, let’s create a super simple Machine Learning model with Tensorflow and Keras. The example was inspired by the following tweet from the prolific Pratham Prasoon:

The code snippet above trains a model with a fixed set of 10 values. These values were the time taken for the HealthCheck to run in 10 different instances. The model and the standard deviation from the values is saved in some shared storage. These files will get accessed in the next step for anomaly detection.

This last step of the HealthCheck basically does the following:

  • It uses the trained model to get the next “predicted” value in the sequence. So, as per the machine learning model, the actual time taken for the HealthCheck to run should be close to this value.
  • Therefore, the actual value of the time taken is compared with this predicted value.
  • To provide some tolerance, the actual value is checked to be within 2 standard deviations from the predicted value.
  • If the actual value is not within the two standard deviations, it is declared anomalous.
  • The HealthCheck could be failed if that is the case

The complete HealthCheck can look like this:

Possibilities and extensions:

The above example is a great starting point. It can be used for lots of interesting use-cases and extensions.

  • The training of the model above was with a fixed set of values. Instead, query Prometheus every hour for such values and retrain the model. This will help keep up the model with changes in cluster configurations. Active-monitor already exports the healthcheck_runtime_seconds metric for existing HealthChecks that could be used for such purposes.
  • Modify other parameters in the model such as number of values used for training, number of epochs, etc. for even more finer and more accurate results.
  • Use similar setup for other use-cases such as time required for DNS resolution, HTTP request latency, CPU/Memory usage, etc.
  • Another interesting use-case could be to simply use the anomaly detection step for non-Kubernetes metrics such as cloud costs, error counts reported by services, etc.
  • On the machine learning side, there are other interesting and sophisticated options for such analytics like this one for metrics that have a sinusoidal pattern.

Conclusion:

Using static checks (where time taken for an operation is compared with a fixed value) for monitoring can be error prone and inflexible. Instead, simple machine learning models can be much more useful for identifying unexpected changes in the behavior of systems and eliminating false positives. Combined with active-monitor, this becomes an extremely powerful way of identifying degraded components quickly. This can reduce mean time to detect (MTTD) failures and mean time to recover (MTTR) by proactively fixing problems.

--

--