How we monitor kubernetes

After almost one year(and lots of failures) here is our Kubernetes experience on how we monitor our production clusters. Hope it will give you some insights about monitoring your clusters(if not, please carry on this was not the article you were looking for).


Our goal was simple, to monitor all of our kubernetes, cluster related, metrics possible. This includes deployment counts, pod health, disk usage, cluster cpu/memory, unhealthy container count etc. 
Kubernetes(Greek for “helmsman” or “pilot”) is a great orchestration tool first developed and used by Google(as far as we know) as a descendant(maybe kind of) of Google Borg. What does it orchestrate? Well, it orchestrates your docker(vm) environment and allows you to manage it all very easy(I guess it depends on how much you suffered along the way) and fast.


Grafana + heapster + InfluxDB

First we started monitoring with Grafana + heapster + InfluxDB combination. 
Maybe one of the easiest ways to start monitoring your cluster if you use kubernetes dashboard that is. Heapster itself is a very basic ui based monitor which allows you to see your cpu and memory usage of a given section on kubernetes dashboard(overall, deployment, individual pods, nodes etc.)

Heapster requires a database to store and fetch data from when needed. This is what Influxdb actually is. Just a db(warning over simplification, actually it’s much much more than that but for the sake of this article we’ll leave it there.)

After setting heapster and influxdb, of course after sometime, you start to dislike them a bit. Not that they have problems or anything but they start to fail in terms of satisfaction(your needs errr.., your monitoring needs that is.)

This is where Grafana comes in(no screen shot for you because we already gave one at the beginning of this article, and don’t think that is kind of a standard screen, there are tons of different dashboards). You can think grafana as a holly grail. You can put pretty much any kind of metric in it and it turns them into beautiful human readable easy to comprehend graphical monitors(necessary data come from that little db remember?). When I said pretty much any kind of metric I mean it, there are people out there who uses grafana to monitor bitcoin prices or stock market data in real time. Seriously you should check them out. Of course you can set alarms to those metrics which are important for you.

  • Pros:
    Easy to install.
    No limits on what you can monitor as long as you provide data.
    Nice community.
    Most stuff is free and also open source.
  • Cons:
    Can be painful when it comes to create or change specific dashboard metrics to your needs.
    Few bugs here and there.
    Lack of it’s own logs, therefore you can get lost when trying to solve a problem within it’s own configuration.
    Premium tools or support is way to expensive.

Sematexts’s SPM

After GrafHeapIn combination we’ve tried another solution, a hosted solution, Sematexts’s SPM. First we loved it. It was so easy to monitor stuff and keep logs of, well pretty much everything. After sometime, not so much. We found different bugs, worked along with their support to solve those, but at the end the juice didn’t worth the squeeze. It became little bit painful and expensive along the way(not to mention slow response times from support) but still that product is somewhat a good memory for us.

  • Pros:
    Great and easy to use pre-configured dashboards(Of course you can create your own).
    Easy to add tons of members from your company if you like.
    Very good integration list(so far unrivalled).
  • Cons:
    Slow tech support.
    Pricing policy.
    Buggy UI.

Grafana + Prometheus

This is the most populer solution, I guess. There are tons of shared dashboard in grafana. Easiest way to setup monitoring on kubernetes so far (not counting saas solutions) is to use grafana. Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. There are significant differences between Prometheus and InfluxDB, and both systems are geared towards slightly different use cases. Therefore, you should not compare them as if they are same, they are not leave it there.

Where InfluxDB is better:

  • If you’re doing event logging.
  • Commercial option offers clustering for InfluxDB, which is also better for long term data storage.
  • Eventually consistent view of data between replicas.

Where Prometheus is better:

  • If you’re primarily doing metrics.
  • More powerful query language, alerting, and notification functionality.
  • Higher availability and uptime for graphing and alerting.
  • Pros:
    Very nice in depth metrics analysis
  • Cons: 
    If you don’t know how to write query to fetch data don’t bother setting this up.

Grafana + kubernetes-app plugin + Graphite

This combination has prepared dashboards, which can do anything you need, to monitor kubernetes metrics. Graphite is roughly the same thing as Prometheus but If you want a clustered solution that can hold historical data long term, Graphite may be a better choice. Also Graphite db is easier to fetch data from as well as easier to write queries(I guess this depends on personal experience with them).

Graphite focuses on being a passive time series database with a query language and graphing features. Any other concerns are addressed by external components.

Prometheus is a full monitoring and trending system that includes built-in and active scraping, storing, querying, graphing, and alerting based on time series data. It has knowledge about what the world should look like (which endpoints should exist, what time series patterns mean trouble, etc.), and actively tries to find faults.

Our Winner

We’ve choosed Grafana + kubernetes-app plugin + Graphite combination. This is a ready to use solution for kubernetes. You can monitor whatever you like to monitor in your cluster with ease. It holds nice historical data, pretty simple to write queries (not so deep but you get the idea of what is happening or happened in a given time) and has great support from community. Also if you have some extras in your company cookie jar then you might try the hosted version of this as well(which we don’t, so we are hosting it on prem).


Kubernetes cluster
Helm(if you choose the easy way)