Creating a Well-Designed and Highly-Available Monitoring Stack for Servers, Kubernetes Cluster and Applications

Or how to navigate on a multi-hull

Published in

Nephely

10 min readMar 1, 2019

It’s no longer a secret for anyone: a well-performing and well-designed monitoring stack is essential for a good quality of service and for the good mood of operators. Always present but often neglected, it is nevertheless a key point of every infrastructure and application design, and should consequently require all of our attention. Not only it should be designed to cover every layer of your infrastructure, it also needs to be very reliable in order to be accessible whenever it’s needed the most: during a major outage (crash of a VM or an AZ for example).

What do I call a well-designed monitoring solution?

It should be highly-available, even in a case of a major outage.
It shouldn’t lose any data (or the least possible) in case of a failure of the monitoring system itself.
It should give good insights of each part of the infrastructure from hardwares to applications, and it should be containing both metrics and logs.
It should gather data in few areas (not having them spread everywhere).
It should gives some context to the object monitored (what is the VM OS, on which VM the application is running, which namespace it belongs, what labels it has associated, …).
Since we are using Kubernetes (and maybe a cloud provider), we need it to compatible with VM and application discovery.

Having these requirements, we can deduct some consequences:

The monitoring system should be deployed across multiple AZ (or able to move freely).
Softwares that monitors Kubernetes cannot be installed within itself.
Data must reside on an storage cluster, and/or be clustered. If you do not have a storage cluster, data should be replicated (covered later in this article).

Choosing applications for the monitoring stack

It is obvious that tools I chose here are not only ones that are suitable for the need, but I selected them because they are free, open-source, and well-known to administrators (and quite a reference for most of them).

Metrics will be collected by Prometheus (I guess there is no need for an introduction). Prometheus is very convenient as it is able to connect to various cloud providers to enable VM discovery and is also able to discover applications inside Kubernetes. By default Prometheus stores its data on disks so we will have to configure it accordingly for high-availability.
Monitoring alerts will be forwarded by AlertManager, which is a sidecar of Prometheus. It can forward alerts via emails, Slack, PargerDuty, VictorOps, and others.
Logs will be aggregated by ElasticSearch, the world famous search engine. As it can be deployed as a cluster across multiple AZs, it will be easy to make it highly-available. It also offers the possibility to create an index per day (or per hour) which will ease the deletion. The ElasticSearch cluster will be monitored via a Prometheus exporter, and maintenance (index deletion) will be delegated to ElasticCurator. The latter will be installed inside the Kubernetes cluster as it is easier to schedule tasks on a regular basis via CronJob and as it doesn’t really matter if it does not get scheduled one time (which only happen if your whole Kubernetes cluster is down for a longer time ElasticCurator is scheduled).
Logs will be parsed and forwarded by both Fluentd and Fluentbit. Their names are as similar as their purposes as they are actually developed by the same company. Both Fluentd and Fluentbit are comparable to Logstash but I chose them as they can provide metrics in the Prometheus format. Fluentd will be used to forward OS syslogs, and Fluentbit will used to forward container logs as it is cloud compatible: it can connect to Kubernetes for service discovery, and adds tags to logs to specify the context (pod, deployment, namespace, labels, …) which will ease the link to metrics. Even if Fluentbit is able to forward syslog messages as well, we won’t be using it for that purpose as it will be installed inside Kubernetes and thus may not work if Kubernetes brakes down.
The monitoring (metrics and logs) will be displayed on Grafana, the goal being to have a single dashboard containing both metrics and logs. Grafana can connect to multiple data-sources including Prometheus, AlertManager and ElasticSearch (obviously).
The OS and hardwares will be monitored by a Prometheus exporter named NodeExporter. It gives metrics for OS, CPU, RAM, disks, raids, network, systemd, … It should be configured to check that every critical services are running: Fluentd, Docker (for Kubernetes VMs), Prometheus/AlertManager/Grafana (for monitoring VMs), and ElasticSearch on its clustered VMs (see the diagram below). Other services being deployed inside Kubernetes, this latter will make sure they are running.
Docker containers (thus the Kubernetes cluster and its applications) will be monitored by cAdvisor, which provides a lot of resources metrics (CPU, RAM, Network, disks, …). It also ties Kubernetes context and tags to metrics for linking purpose.
Kubernetes objects (deployments, statefulset, daemonset, cronjob, …) will be monitored by kube-state-metrics. It is a optional component of Kubernetes (mainly used for horizontal pod autoscaling) and provides metrics in the Prometheus format. This will be used to monitor if a replica is missing, a pod keeps restarting, node resources allocations have exceeded, a job failed or haven’t been scheduled, a deployment failed, …

Now that softwares have been chosen, let categorize them according to their installation location:

Outside Kubernetes (regular services): Prometheus/AlertManager, Grafana, ElasticSearch cluster, Fluentd, NodeExporter and cAdvisor (actually containerized).
Inside Kubernetes: Prometheus/AlertManager, Fluentbit, kube-state-metrics and ElasticCurator.

And thus the whole solution can be designed as follow:

The ‘VMs’ section correspond to every VM that is present in the infrastructure (monitoring’s, ElasticSearch cluster’s, and Kubernetes’)

Wait! Don’t we have two Prometheus servers?

That is correct, we will be using (at least) 2 Prometheus servers and there are multiple reasons for this decision: first we don’t want to overload Prometheus as collecting and parsing metrics are quite resources consuming tasks, and especially when you have a ‘large’ number of applications and/or VMs to monitor. Second, according to our high-availability requirements if we were using a single Prometheus server it couldn’t be installed inside Kubernetes, therefore it would a little more difficult to configure it for applications discovery (not that much actually). And third, since it doesn’t make sense to monitor applications inside Kubernetes when Kubernetes is facing an outage, why not use its benefits and deploy our Prometheus application inside.

If you are planning on hosting a lot of applications inside your Kubernetes cluster, and/or you have a large number of VMs, you should consider splitting metrics collection among multiple Prometheus servers and then use the Prometheus Federation to merge their data.

Making data highly-available

There are multiple ways of achieving it, depending of your hosting environment (multi-AZ or not, access to a storage cluster or not), so I will try to cover most case scenarios.

Whenever possible you should make sure that two VMs running the same service are not hosted by the same physical machine.

Prometheus data

As said earlier, Prometheus writes its data on disk per default, nevertheless it is compatible with multiple storage backends so I definitively recommend you choosing one being compatible with clustered deployments, like InfluxDB (clustered version is not free) or PostgreSQL / TimescaleDB. You can also check the Thanos project (GitHub repository) as it is designed to store Prometheus data on multiple object storages. If you don’t want to manage such solution and keep Prometheus data on disk, you can still check indications under the ‘other data’ section.

ElasticSearch data

ElasticSearch clusters are composed of three types of nodes: Master, Data and Ingest (Tribe nodes are not useful here). When designing your cluster for high-availability, you should consider having one replica of each type in each AZ, with a minimum of 3 AZ thus 3 instances of each. If you deploy your cluster on 2AZ it will not be highly-available in case of an AZ outage. You should also have the same number of master and data nodes in each AZ as otherwise there are chances your cluster won’t function correctly: if a AZ that goes down and was containing more masters than other AZs the ElasticSearch cluster may not find consensus to elect a new master, and if it was containing more data nodes than other AZs you may not have access to some data. However, having the same number of data in each AZ actually doesn’t prevent the possibility of having some data inaccessible during a AZ outage (as ElasticSearch clusters are not AZ-aware) but it lowers the risk. Besides, there is little chances that your infrastructure needs more than 3 master nodes so having one per AZ should be sufficient.

If you are deploying this stack on a single AZ, you still need at least 3 master nodes to make them able to find consensus whenever electing a master, but you don’t need any specific number of data and ingest nodes, just make it fits your needs with a minimum of two nodes for high-availability.

(more informations in the official documentation)

Other data

The last component that stores data is Grafana (dashboards, users/rights and configuration). If your deployment is single AZ then you shouldn’t face any issue, just configure your VM Cloud-Init script to mount the volume whenever the VM gets re-created.
If you are planning on deploying your stack multi-AZ, things can become a little more complicated. You basically have two options: the ‘hard’ way, or the ‘dirty’ way.

The ‘hard’ way consist on having a storage cluster that can be accessed from every AZ of your deployment (quick reminder that AWS EBS is not, as blocks are tied to a single AZ). Deploying Ceph would a good choice to palliate this issue but it will definitively be an overkill if you are not planning on using it for anything else (Prometheus data?).

The ‘dirty’ option consists on creating a workaround to access your block storage data from every AZ. The idea here would be to to regularly take a snapshot of your volume and make it accessible from every region, on an object storage for instance (EBS snapshots are stored on S3 in the case of AWS). Then configure the Cloud-Init script to create a new volume from the latest snapshot, and attach it to the newly created VM. Note that this methodology comes with some data loss whenever the VM crashes as you will miss data that have been created since the latest snapshot. But in the case of Grafana, as its data shouldn’t be updated that regularly, this is hopefully a ‘suitable’ solution. Evidently the more frequent the snapshot, the less data will be lost.

Accessing data

When VMs are members of an AutoScalingGroup we cannot guess their IP addresses in advance which makes impossible to configure our services to communicate to their backend or cluster members (especially the ElasticSearch cluster, Prometheus servers, Prometheus storage backend if any, and Grafana). To solve this issue you should be accessing your metrics data and other applications via a load-balancer, and configure it to be updated according to the status and members of the AutoScalingGroup.

Going further

I didn’t want to point it out earlier but we actually have two ‘SPOF’ in our stack: nothing extract metrics from Prometheus servers (the service is monitored by Kubernetes or NodeExporter), and nothing monitors that users can access your application from the Internet (and operators to Grafana if deployed on a public cloud). To address the latter issue you should deploy a service that check your application connectivity. It will have to be installed outside of your deployment and access your application via Internet (in opposition to the local network). Regarding the Prometheus issue, it is able to extract and process metrics from itself which might be sufficient, especially for the one sitting inside Kubernetes (although not addressing the high-availability issue), but I actually recommend deploying another Prometheus service and configure them to check one-another.

Some advices for the Grafana dashboard

Creating a performing monitoring dashboard is crucial as we can sum-up the need saying: the better the monitoring the shorter the response time to failures. I’m presenting here some advises that goes in that sense:

Since we designed here an architecture that enable us accessing all the informations from Grafana, why not taking the benefit of it and create a single dashboard? Alerts and key figures at the top, followed by one section per layer of our infrastructure, containing every data and logs needed to investigate this set.
When configuring alerts, you should consider having both short and long time alerts, with high priority for the firsts and lower priority for the last. Let me give you an example: let say you have an alert that is triggered whenever a container restarts more than 3 times under 15 minutes. This alert should have a high priority as it shows that your application is facing a major issue and thus affecting your service. But you also should have an alert if a container restarts more that 3 times within a week (or a longer period) as this is definitively not something that should happen. This alert will have a lower priority than the previous one as it only affects the service a small amount of time and do not occur too frequently.
Last, to be in line with SRE practices you should monitor your SLO and conversely your toil budget on different time frames (weekly, monthly, annually). This will help you planning your maintenances and features delivery (in order not to break your SLO, therfore your SLA).
Do not forget: 100% availability SLA is not a reachable target, too high SLA induces slower deliveries of features and too low SLA means unhappy users… As always, everything is about balance.
To get more information about this practice, I recommend you watch the SRE video play list created by Google Cloud Platform.

Here it is, my recommended monitoring architecture!