AIOps as a journey — beginning

Published in

METRO SYSTEMS Romania

6 min readFeb 13, 2020

I had an idea about two years ago. We were offering Kafka as a service and saw it is hard to debug in real time on performance data with distributed logs. A fresh approach would have given us the opportunity to expand our knowledge.

Unfortunately, at that time, the idea of implementing AI in our processes sounded pretty exotic, but today we are beginning this journey.

Let’s take a look at the Cloud Native Interactive Landscape that supports the digitization of businesses.

Cloud Native Interactive Landscape

What do all of those tools have in common? As we gain a deeper overview , we can observe that they are all distributed or constructed to help a distributed approach.

As an example, distributed approaches are present on each level:

Cloud (AWS, GCP, Azure)
Software stacks (Kafka, Redis, Cassandra, and with the most impact on the years to come Kubernetes)
Microservices

We are surrounded by data. Everything is interconnected and it will be even more so.

Let’s take a look at the current situation, and keep the future in mind:

4PB of data generated daily on Facebook
28 of PB estimated by 2020 on wearable devices like smart watches and bracelets — Who has such a device?
we will generated 463 EB on data each day, and that the total amount of digital data will increase from 4.4 to 44 ZB in one year by 2025. Ten times!

Most part of this data is stored and processed in systems in the landscape. And those systems add to these statistics with even more data like performance metrics, lots of logs and probably even traces.

All the tooling that you saw being developed in the last years had the scope to build better, faster, more reliable systems for our needs, and we need more, since complexity is still increasing.

Let’s take an example: Redis. “Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker”

As any other component, Redis can be instrumented through metrics, such as:

CPU usage
Memory usage
Number of connected clients
Number of keys/records served by the server
Type of server (master or slave)

We can check each metric at a time or even correlate a couple of them in dashboards, but there is no real aggregation at a fundamental level. For that ML and the more modern concept of AIOps come in hand.

Making sense of data means analyzing trends in multiple coupled metrics or even transform them to features with specific algorithms like PCA.

As DevOps experience is already vast and involves such components, it still needs to grow a lot more into looking at ML and AIOps. From that field, it makes sense to consider the following areas with associated algorithms:

Forecasting: Linear regression, change detection, seasonality decomposition, and Box-Jenkins.

Clustering: Levenshtein distance and Latent Dirichlet Allocation for topic modeling, k-means.

From all these options we managed to play a little bit with linear regression and also k-means clustering.

There is more than enough data for these two techniques and not all of it is usable out of the box.

If we start with a simple regression, that depends on the correlation coefficient, this will tell us how well the dependent and independent variables are correlated.

An example that we might think of will bring us some useful case. We want to see if the number of clients that are connected to a Redis broker increases the amount of consumed memory.

If the correlation coefficient is calculated, we can see that it’s pretty bad, even really, really bad, as you can also see from the plot. Normally it should be between 0.7 and 1

Preprocessing techniques like standardization or normalization didn’t seem to make any difference on the final value.

A weak coefficient means it’s unlikely to solve our problem with a regression.

If you try to fit it with a polynomial regression, more precisely factor three, this happens.

Veeery interesting but veeery over-fitted, if you can call it that way.

These teaches us a thing or two about input data. First thing, model it for a specific case, but not in too much detail. What you see above happened because we didn’t take into consideration datacenters, type of nodes, environments. Those are important.

Ok, we know what to search(more or less), we know what to expect(a regression). If you calculate the correlation coefficient on each data frame that you manage to construct from ELK, eventually you will find a very good one, but you still need to filter the data.

First graphic shows us the correlation between CPU time consumed by Redis and the one on the system on multiple environments.

Filter it and things will “improve” but it does not bring us any value.

Correlation coefficient for filtered data

Finding two dependent and independent variables that correlate is hard, but what about using two variables as input and see how they are related to the output one.

Let’s add key space hits to the equation.

As you can probably see it doesn’t look good. With light blue you have the actual values and we orange there are the predicted values. I kept the plot simpler for better overview.

How can I tell that mathematically, please take a look at coefficient of determination?

We learned the hard way that one essential concept needs to be taken into consideration in order to have better chances for successful use cases of AIOps, and that is Observability.

Observability is a property of a system that was created with following facts in mind:

• No complex system is ever fully healthy

• Distributed systems are pathologically unpredictable

• It’s impossible to predict all the partial failures in which the system may fall

• Embrace failure on every stage

• Easiness of debugging is a priority

And if it doesn’t always work with a regression, and it shouldn't, what can we use? Maybe we need to take a look also on other algorithms. K-means?

My dataset contains three columns, but let’s keep it simple and try to cluster them using normal CPU usage and memory usage.

This is another case that is not related to Redis services and uses only metrics at OS level.

You can easily see that it created two different clusters from the given data.

In time, gaining a deeper knowledge of what this clusters really mean, and coupling them with other performance metrics from the application, we can use it to predict anomalies. And on that direction we are working at the moment.

Unfortunately the number of cases to be analyzed is overwhelming and neither of us can actually comprehend the amount of work to be invested in this topic.

That being said, I invite you to join us on this daring road that we decided to take. It will be hard, but it should be also fun, and rewarding, and there is no better way than learning from each other.