Scaling out Grafana with Kubernetes and AWS

When production workloads that service multiple customers which have millions of users, monitoring those workloads become almost as important as the workloads itself.

I don’t believe that monitoring is necessarily exclusive to technical operations teams. Now a days it’s easier than ever to get data which may be of your interest: Managers might want to have access to billing dashboards and alerts, Java developers may want to measure how many times the Garbage Collector was triggered and what was the status of the JVM Heap when that happened, Operations team may want to have a global vision of production platform including blackbox and whitebox monitoring, system metrics, databases, etc.

But that does not mean that for a few users you have to create a expensive, over-sized and over-redundant platform. You just need to be prepared for the moment to scale horizontally without pain.


Grafana is an open-source tool to visualize data that comes from multiple Data Sources: ElasticSearch, Graphite, Prometheus, and much more. (Find the complete list here:

If you deploy Grafana official Docker image in your Kubernetes cluster, it works out-of-the-box with the default parameters. It will store sessions and plugins on disk and will use SQLite db to store all dashboards and users.

Problem is, that default architecture will not scale. State is fully coupled with the application and we need to decouple it to deploy multiple replicas of Grafana. So let’s list how we can move the state out of Kubernetes using AWS services:

  • Grafana database: Grafana supports MySQL and Postgres, so we can use RDS for this.
  • Grafana sessions: Grafana recommends using Redis or Memcached to cache users sessions. Hopefully, AWS provides ElastiCache supporting both of these in-memory key/value store engines with multiple options for clustering.
  • Grafana plugins: Grafana needs to store plugins on filesystem. If we install a plugin, we just need other Grafana containers being able to load them. Any shared filesystem could serve for this purpose and AWS provides EFS to solve this. In order to mount this storage as Persistent Volumes from EFS in Kubernetes, there’s a external storage plugin called EFS Provisioner. Here there’s a nice example on how to configure it.


We automate everything from infrastructure and provisioning to deployment. Our stack is based on Kubernetes on top of AWS to deploy containerized monitoring tools, operations automated jobs (Jenkins Slaves) and more (If you are curious about how to create a fully automated private Kubernetes Cluster, you can check out this other article I wrote). Terraform will manage infrastructure components: VPCs, Subnets, EC2 Instances, RDS, ElastiCache, EFS, etc.

We use Ansible for provisioning and coordinating the deployment, acting as a link between the infrastructure and the container deployment itself. So we describe the resources (EFS, ElastiCache, RDS) filtering specific tags and once we have identified the resource we render a jinja2 template that represents the Kubernetes Grafana deployment.

We are so crazy about automation that even for the EFS provisioner we launch a micro instance, we mount the EFS filesystem, create the base directory if it’s not created, we kill the instance and then we deploy the EFS provisioner pointing to the new allocated filesystem.

But to keep in the scope of this article I will skip all the AWS resource creation and deployment coordination, I will just focus on Kubernetes deployment manifests.

Assuming you have already deployed the EFS Provisioner creating the new storage class, requesting storage from EFS is as easy as:

10 GigaBytes is more than enough for plugins, we could even think of less capacity, but for 3$ a month I just didn’t want to spend more time thinking about it ;)

Now, we can deploy several replicas of Grafana:

These are the important changes from default configuration:

  • Database configuration
  • Session Store configuration
  • Mounting the persistent storage. Under /var/lib/grafana you can have sessions and plugins, since we are storing sessions in Redis, only plugins will be there.


You may not need High Availability, but it’s important to not to block yourself when you need it. You don’t have to make it super expensive at the beginning. You can start with small RDS and Cache instances, small storage allocation, etc. The important thing is to allow scaling out easily when needed.

Thanks for reading it!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store