DevOps: Before/As it happens (Monitoring)

Rizbe
Axial Engineering
Published in
6 min readNov 24, 2016

You’re having a great work week, pushed out a couple of great changes, were able to work on some requested features, and were able to go almost an entire work week without something bad happening. You walk in on a Friday, expecting the day to be smooth and short but something horrible happens: your entire platform is down and you have no idea why. That’s when you need to rethink if your monitoring stack is right for you.

When I first joined Axial, we were using Zabbix. The first time I opened the web interface, it looked like it was something that was built in the early 2000s. I started to explore all the features and it seemed like a system that tried to do it all. The metric collection was not the best, the checks were not easy to understand, and it seemed way too complex. At this point, I have had years of experience with system monitoring using Nagios, Icinga2, and Sensu. Since Axial has its entire infrastructure in AWS, I wanted to reevaluate our current monitoring stack to see what would fit us best. The main criteria that I was looking for were: is it easy to scale, how easy would it be to write checks, and can we automate it?

How Sensu Works

Thinking about those 3 questions, the answer seemed obvious; Sensu had everything we wanted. Out of the gate, it was very light weight and fast and can easily be scaled using RabbitMQ as a message bus and Redis as an in-memory data store. Sensu can easily be deployed and automated using something like Ansible, Chef, Puppet, and many others. With Sensu, you can also reuse just about any checks from previous monitoring systems or write your own in any language. The basics of all checks is to check for something and, depending on it’s state, have it exit with a status code of 0, 1, 2, 3. Here is an example check written in bash:

#!/bin/bash

# Check the state
rand=$RANDOM # 0 to 32k

# Report how dire the situation is
if [ $rand -gt 22000 ]; then
echo "Ok: random number generated was high enough ($rand)"
exit 0
else
if [ $rand -gt 11000 ]; then
echo "Warning: random number generated was $rand"
exit 1
else
echo "Critical: random number generated was $rand"
exit 2
fi
fi

echo "Unknown: How did I get here?"
exit 3 # or higher

Sensu works great for us because all of our infrastructure runs Linux and lives in the cloud. Regardless of what monitoring system you think works best for you, let’s talk about how to write checks. Before writing any checks, you should take an audit of your infrastructure to determine what’s critical and what isn’t critical. After the audit is complete, look at each critical component and determine how something could break and how it would impact your organization.

Let’s talk about a critical infrastructure piece at Axial and how I determined what checks were needed. We send all of our logs from all our critical servers to Graylog2. Graylog’s backend is Elasticsearch, which is where the actual logs are stored. If our Elasticsearch cluster went down, it would be very bad and we might lose a ton of log data. Before I can write any checks, I need to determine how the Elasticsearch cluster can break.

First thing that comes to mind is determining the cluster health, if the health is green, that means from a high level everything is going as expected. Why stop at cluster health, let’s check the health of each individual node to determine if the cluster turned yellow or red and which node might be the cause. We can take a look at the file descriptors to ensure we are not running out. We can check how many shards we have because too many shards can kill your Elasticsearch cluster. There is more we can check for from the Elasticsearch side but let’s also consider the system too. We want to verify the memory, CPU, and disk are also fine on the cluster.

Now that we thought of a few ways that Elasticsearch can break, we can write checks to check for those things. Best part of Sensu is that there is a huge community behind it and there are already a lot of checks written for tons of popular services such as Elasticsearch, Nginx, Redis, and etc. If a check does not exist, you can easily write your own using any language to check for the particular scenario. I can talk about Sensu for days, but let’s discuss one more aspect of monitoring.

Using things like Sensu will alert you when something happens, but it’s not very good at showing you trends of things that have already happened. So, what am I talking about? Picture this: your main web server sits at about 40% utilization and you have set up Sensu to warn you if it hits 80% and critical if it hits 85%. Let’s say over the past 6 weeks, your server starts climbing from 40% to about 75% utilization but you have no idea because it hasn’t triggered Sensu yet. Clearly something has changed and this is how issues tend to crawl up on you. Things could be building up and you may have no idea. So how does Axial deal with this problem? We use Telegraf, InfluxDB, and Grafana.

What do the 3 things I mentioned above do? Telegraf collects every type of metric from your system. It not only does system metrics, but you can set it up to collect application metrics as well. If you're running Nginx, you can collect just about every Nginx metric you can think of. Where do we store all these metrics? We put them into InfluxDB: an amazing time series database which is also incredibly fast. Now we have all this data, what do we do with it? Using Grafana, you can use all the metric data to create incredible dashboards with live data coming in. Using this method, we are able to see trends in our infrastructure and catch things before they happen.

Grafana Dashboard with Live Metrics

The Axial monitoring stack consists of Sensu, Telegraf, InfluxDB, and Grafana and it has served us well. The thing about monitoring is that you always have to evaluate your infrastructure and determine if you have all the proper checks in place. For the last 6 months, we have been doing a major push for Docker and using Kubernetes. I frequently ask myself if our current implementation is optimal for what we are doing? The answer is no; the stack might be fine but the way we are handling monitoring has to change because we are containerizing our entire infrastructure. That means a new set of challenges: rethinking how we monitor and how we collect information. If you spend your time thinking about how things will break, you can create an environment where you can catch most issues before or as they happen. Also remember this, you can’t envision every scenario, so if something does break and you were not prepared, it’s fine, learn from that and always write checks.

--

--

Rizbe
Axial Engineering

DevOps engineer by day and an explorer of life during the night. Well, sometimes I go looking for life during the day as well.