Never fail twice

14 min readJul 11, 2018

To paraphrase H.S. Thompson’s Fear and Loathing in Las Vegas, ’We had two virtual machines, seventy-five sites, thousands of metrics and machines to monitor, a bunch of Python scripts, one database and one message queue, InfluxDB, and a whole galaxy of multi-colored libraries… and also pandas, NumPy, Dash, Flask, SQLAlchemy. Not that we needed all that for creating a monitoring system, but once you get locked into serious component collection, the tendency is to push it as far as you can.’

Failing twice is bad enough, but failing at failure detection is even worse.

The monitoring of distributed systems is not a trivial task. There are many non-obvious obstacles in your way. However, there are also many solutions for performing various different monitoring tasks. In this article, I am going to explain why an open source solution did not suit for Playtech, what our analysis of successful and not-so-successful monitoring projects showed us, and last but not least — why we decided to build Yet Another Alert System.

What is system monitoring?

In general, we can divide monitoring tasks into two categories:

Low-level monitoring — the monitoring of infrastructure, CPU, disk and memory usage, networking, garbage collection cycles, etc. In this field, we can find many good working solutions. In most cases, applying some simple thresholds is enough. In the case of Java virtual machines, we can count garbage collection runs by performing peak analysis and then make predictions about memory usage and possible leaks. However, low-level monitoring cannot reflect business logic processes in any adequate way due to the great number of interdependencies between various components and services.
High-level monitoring — the monitoring metrics are business indicators or key performance indicators (for example, user sessions, payments, transaction amounts, logins, etc.). This makes it possible to indirectly monitor complex system behaviour, especially in the case of distributed systems.

Why is it difficult?

1. Various problems may lead to non-obvious system behaviour.

2. Various metrics may have different correlations in time and space.

3. Monitoring a complex application is a significant engineering endeavor in and of itself.

4. There is a mix of different measurements and metrics.

System monitoring in Playtech

My name is Alexandr Tavgen and I work as a Software Architect in Playtech, where we have been facing a lot of issues with the early detection of outages and other problems. We handle financial transactions in strictly regulated markets, so the architecture of our system is service-oriented and its components have very complex business logic and states. There are also different variations of our back end.

In our case, low-level monitoring and various different tests do not address a large set of specific issues:

Due to the high complexity of our system and the huge amount of adjustable settings it has, activating the wrong settings could in some situations lead to the degradation of the main KPIs or to hidden bugs affecting the overall functioning of the system.
There are a lot of 3rd-party integrations in different countries. If they encounter problems on their side, those problems will inevitably be reflected in our functionality. Such problems cannot be caught by low-level monitoring. Any workable solution should somehow involve the monitoring of key performance indicators.

In Playtech, we have been using a commercial solution from Hewlett-Packard called Service Health Analyzer (SHA), which has been far from ideal.

According to Hewlett-Packard, ’SHA can take data streams from multiple sources and apply advanced, predictive algorithms in order to alert to and diagnose potential problems before they occur.’ In reality, it is a black box. It is impossible to change its settings or tune it. In case of problems, you need to contact Hewlett-Packard and wait for months for a solution. And often, the solution does not work as required. When we used SHA, it yielded a huge amount of false positives, and what’s worse, false negatives. It meant that some problems were not detected or were caught much too late and that led to losses for everybody.

The diagram below shows that SHA’s granularity is 15 minutes, which means that even if SHA detects a problem, the minimal reaction time is 15 minutes.

So we often ended up standing around and waiting like this:

Standing on the shoulders of giants

Many companies have built their own solutions for monitoring their systems. Not all of them have been success stories.

Etsy — the Kale System

Etsy is a large online marketplace of handmade goods, with headquarters in New York. Their engineering team collected more than 250,000 different metrics from their servers and tried to find anomalies using complex math approaches.

Meet Kale.

One of the problems with Kale was that the system was built using different stacks and frameworks. In the picture, you can see 4 different stacks plus 2 frameworks. As a result, highly qualified engineers with experience in working with different languages and frameworks were required for maintaining and developing this system as well as for fixing any bugs that were found.

Their approach to monitoring was also problematic. Kale searches for anomalies or outliers. However, if you have 250,000 metrics, every heartbeat of the system will inevitably result in a lot of outliers just because of statistical deviations.

Etsy’s engineers made a futile attempt to combat the false positives and finally the project was closed. Andrew Clegg has discussed the Kale project in a very good video that I strongly recommend everyone to watch in its entirety. In this article, I want to emphasize one slide from his video.

Firstly, anomaly detection is more than just outlier detection. There will always be outliers in any real production data in the case of normal system behavior. Therefore, not all outliers should be flagged as anomalous incidents at any level.

Secondly, a one-size-fits-all type of approach will probably not fit anything at all. There are no free lunches (https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization). The reason why Hewlett-Packard’s SHA does not work as expected is that it is trying to be a universal instrument.

Finally, the most interesting part of the slide addresses the possible approaches to solving these problems — alerts should only be sent out when anomalies are detected in business and user metrics and anomalies in other metrics should be used for root cause analysis.

Google SRE team’s BorgMon

The wonderful book Site Reliability Engineering contains a chapter that discusses the problems of monitoring production systems in Google (https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html).

According to the authors:

Google has trended toward simpler and faster monitoring systems, with better tools for post hoc analysis.
[They] avoid “magic” systems that try to learn thresholds or automatically detect causality.
Rules that generate alerts for humans should be simple to understand and represent a clear failure.

Based on these principles, Google’s engineers built the BorgMon system that runs in production clusters.

Time series

A time series is a series of data points that are indexed (or listed or graphed) chronologically. Economical processes have a regular structure (for example, the number of sales in a store, the production of champagne, online transactions, etc.). Usually, they exhibit seasonal dynamics and trend lines. Using this information simplifies analysis.

For example, let’s take the weekly data of a company — let’s say, the number of transactions in an online store. We can see strong daily regularity and trend lines (the two peaks represent a performance test). On Friday and Saturday, there is an increase in activity, and after that, activity decreases.

I discussed time series modelling in an article published in Medium (Time series modelling). In short, every measurement consists of a signal and an error component/noise, because our processes are affected by many factors.

Point_of_measurement = signal + error

Let us assume that we have a model that describes our signal with some precision. We can subtract the model’s values from our measurements, and the more our model resembles the real signal, the more our residue will approximate the error component or stationarity or white noise. And we can easily check whether our time series is stationary or not.

In my article, I provided some examples of modelling time series using linear and segmented regression.

For monitoring purposes, I chose modelling using moving statistics for mean and variance. A moving average is essentially a low-pass filter that passes signals with a frequency lower than a certain cutoff frequency. When used, it makes the resulting time series data smoother, removes noise, and leaves the main trend lines intact.

Let us take a look at the data from another week after applying a moving average filter with a window of 60 minutes (the peaks caused by performance tests have been removed).

Moving variance data has been collected in the same way for defining boundaries for the models. The result looks like something that Salvador Dalí might have painted.

If we embed our actual data into the model, we can easily detect outliers.

Now we have all the necessary components for creating our own alert system.

An important notice

The successes and pitfalls of the Kale project reveal a very important truth — monitoring and alerting is not the same as looking for anomalies and outliers in the metrics, because there will always be some outliers in any set of metrics.

In fact, there are two logical levels:

The first level involves searching for anomalies in metrics and sending out notifications if outliers are found. This is the information emission level.
The second level involves receiving such information and making decisions as to whether they represent real problems or outages. This is the information consumption level.

It is similar to how we as human beings solve problems. If we notice that something strange is happening in one part of a system, we check the other parts, propose a hypothesis, and try to disprove it. Later, we make a decision based on the gathered information.

System architecture

In the starting phase of our project, we decided to try Kapacitor from the InfluxData stack, because it offers the possibility to implement user-defined functions in Python. However, each UDF requires a separate process, and if there are thousands of metrics, it could result in a huge overhead. In the end, we chose the Python stack, because it offers a great ecosystem for data analysis and manipulation, fast libraries such as pandas, NumPy, excellent support of Web solutions — you name it.

For storing our metrics, we chose InfluxDB. It has been widely adopted by the community, because it is fast and reliable and comes with great support.

I came to Python from the Java world. This was the first large project that I implemented fully in Python. I did not want to create a menagerie of stacks for one system and it turned out to be the right decision.

Our system is built as a set of loosely coupled components or microservices that are executed on their own Python virtual machines. It is a natural way of defining the boundaries and responsibilities of each component, which makes it easier to extend the system and add features independently and on-the-fly without the fear of breaking something. It also makes it possible to perform distributed deployments or implement a scalable solution just with some configuration changes.

Our system also has an event-driven design. All communication goes through the message queue and the system works in an asynchronous way. I chose ActiveMQ for implementing the message queue, but it can easily be replaced with RabbitMQ, because every component communicates using the STOMP protocol (Simple Text Oriented Messaging Protocol, http://stomp.github.io/).

Event Streamer is the component that holds Workers, fetches data regularly, and tests this data against the statistical models managed by Workers.

A Worker is the main working unit that holds a set of models together with meta-information. It consists of a Data Connector and a Handler. The Handler receives data from the Data Connector and tests it against a set of models using a specific strategy, and if a violation is found, the Agent fires a message.

Workers are fully independent and every cycle is executed using a threading pool. Most of the time, they perform I/O operations in the InfluxDB (or any other database solution) and the global interpreter lock of a Python virtual machine does not affect concurrent executions. The number of threads can be set in the configuration file. In our case, 8 threads was optimal.

The Rule Engine consumes the information provided by the Event Streamer. It is subscribed to a specific topic and every violation message is added to the tree of dictionaries that is associated with every site (a tag is used in the InfluxDB). The system stores all the events received within a specific time period, and after new messages are received, the oldest events are deleted. All keys are stored in PriorityQueue and ordered by time stamps.

Rule Engine Architectural high level overview

Every message is sent to the Rule Engine where the magic happens. In the proof-of-concept stage, I used some hardcoded rules in this component (for example, ’send an alert if one metric is decreasing and another increasing’). However, this was not a universal solution, because code changes were required whenever I needed to add a new rule. In addition, a universal language must be implemented for defining these rules.

I used YAML for defining the rules, but it is possible to use any language if a specific parser is implemented.

The rules are set using regular expressions or simple prefixes/suffixes for the names of the metrics. The speed is the speed at which the metrics degrade, which I will discuss later.

When the Rule Engine component is started, an abstract syntax tree is built and every violation message is checked against it. If a rule is triggered, an alert is sent out. It is possible for an event to trigger more than one rule in which case the rule hierarchy defines which rule is more important.

Assuming that the dynamics of the incidents in the system escalate in time, we can take into account the speed and acceleration of the degradation of the metrics (which correspond to, respectively, the severity and the predicted change in the severity of the incident).

Speed is an angular coefficient or a discrete derivative of a particular metric, which is calculated for every violation. The same applies to acceleration or the second order derivative. These values could be set in the rules. The cumulative first and second order derivatives can be used in the overall assessment of an incident.

All Rule Engine computations are performed in memory, so the average time of checking one violation message against the rules is a bit less than 1 ms in the case of our configuration. It means that we can check more than 1,000 messages per second.

In addition, every Rule Engine process is independent, so it is very easy to shard it. We can create a hash from the service name and send it to a separate sharded topic that is collectedby a Rule Engine process. If necessary, we can process tens or hundreds of thousands of metrics per second like this.

The Supervisor is the component that routes alerts to different consumers, removes old, stale alerts, and monitors the overall system health.

So now we have a working back end, but we still need a user interface for operating the system.

I wanted to stay within one stack as much as possible and not touch JavaScript at all. Therefore, Dash (dash.plot.ly) seemed like a great choice.

It was written on top of Flask, Plotly.js, and React.js and is ideal for building data visualization apps with highly customizable user interfaces in pure Python (https://dash.plot.ly/introduction).

The following components of our system serve the WebUI:

Worker Manager — a component that manages a particular Worker or site.

The set of models is displayed with weights and various other parameters. In the picture, you can see that two models exhibited some spikes, which affects the overall resulting model. Setting this weight to 0 will reduce accidental factors, but averaging by default works pretty good as well.

The replay functionality makes it possible to test any individual Worker or the whole site against any historical data, which drastically helps with understanding the system behavior or tuning the models.

There is also empty space in the WebUI for one future extension for generating models.

Report — a component that builds on-demand dynamical reports, where all the metrics that are related to the rules and all the alerts are recorded. When an alert is fired by the Rule Engine, a report link is generated. It is possible to receive a live stream of this data.

We have also integrated Grafana into our system, which means that for each alert, a link to the Grafana Dashboards is also generated.

A word of caution, however: due to the event-driven nature of the system, alerts are generated every time when violations are detected and rules are triggered. This means that if we have some downtime, every heartbeat of the system will result in a lot of alerts and we need deduplication and visualization for that. We have our own deduplication and routing component called Supervisor. However, that component was created for other purposes. My colleague found a brilliant solution. Meet Alerta (http://alerta.readthedocs.io/en/latest/).

According to its product information page:

The alerta monitoring system is a tool used to consolidate and de-duplicate alerts from multiple sources for quick ‘at-a-glance’ visualisation. With just one system you can monitor alerts from many other monitoring tools on a single screen.

It is a brilliant instrument that can easily be integrated with other components and embedded into our WebUI. The Alerta WebUI code is written clearly in Angular and we made our own improvements to the UI/UX of the Alerta front-end part. There are a lot of existing integrations with Slack, HipChat, Email, etc. and your can easily add your own integrations by writing the necessary hooks. What makes this solution particularly suitable for us is that it is possible to write all the alerts from Alerta back to InfluxDB and display the statistics in Grafana.

Conclusion

It took 6 months from our first meeting till the product launch. Our system detects problems much earlier than the solution from Hewlett-Packard. Furthermore, it also detects some problems that were previously hidden from us. We are constantly expanding the system and adding new functionality. It has been an amazing adventure that allowed us to touch upon nearly every field of computer science and software engineering and this adventure is still unfolding.

I would like to thank Andrew Clegg for sharing his experience, InfluxData for their great support and amazing product, the Alerta team and community for their efforts in creating a great open source product, and everybody in Playtech who was involved in this project.