Scaling out Grafana with Kubernetes and AWS


When production workloads that service multiple customers which have millions of users, monitoring those workloads become almost as important as the workloads itself.

I don’t believe that monitoring is necessarily exclusive to technical operations teams. Now a days it’s easier than ever to get data which may be of your interest: Managers might want to have access to billing dashboards and alerts, Java developers may want to measure how many times the Garbage Collector was triggered and what was the status of the JVM Heap when that happened, Operations team may want to have a global vision of production platform including blackbox and whitebox monitoring, system metrics, databases, etc.

But that does not mean that for a few users you have to create a expensive, over-sized and over-redundant platform. You just need to be prepared for the moment to scale horizontally without pain.


Grafana

Grafana is an open-source tool to visualize data that comes from multiple Data Sources: ElasticSearch, Graphite, Prometheus, and much more. (Find the complete list here: http://docs.grafana.org/features/datasources/)

If you deploy Grafana official Docker image in your Kubernetes cluster, it works out-of-the-box with the default parameters. It will store sessions and plugins on disk and will use SQLite db to store all dashboards and users.

Problem is, that default architecture will not scale. State is fully coupled with the application and we need to decouple it to deploy multiple replicas of Grafana. So let’s list how we can move the state out of Kubernetes using AWS services:

  • Grafana database: Grafana supports MySQL and Postgres, so we can use RDS for this.
  • Grafana sessions: Grafana recommends using Redis or Memcached to cache users sessions. Hopefully, AWS provides ElastiCache supporting both of these in-memory key/value store engines with multiple options for clustering.
  • Grafana plugins: Grafana needs to store plugins on filesystem. If we install a plugin, we just need other Grafana containers being able to load them. Any shared filesystem could serve for this purpose and AWS provides EFS to solve this. In order to mount this storage as Persistent Volumes from EFS in Kubernetes, there’s a external storage plugin called EFS Provisioner. Here there’s a nice example on how to configure it.

Deployment

We automate everything from infrastructure and provisioning to deployment. Our stack is based on Kubernetes on top of AWS to deploy containerized monitoring tools, operations automated jobs (Jenkins Slaves) and more (If you are curious about how to create a fully automated private Kubernetes Cluster, you can check out this other article I wrote). Terraform will manage infrastructure components: VPCs, Subnets, EC2 Instances, RDS, ElastiCache, EFS, etc.

We use Ansible for provisioning and coordinating the deployment, acting as a link between the infrastructure and the container deployment itself. So we describe the resources (EFS, ElastiCache, RDS) filtering specific tags and once we have identified the resource we render a jinja2 template that represents the Kubernetes Grafana deployment.

We are so crazy about automation that even for the EFS provisioner we launch a micro instance, we mount the EFS filesystem, create the base directory if it’s not created, we kill the instance and then we deploy the EFS provisioner pointing to the new allocated filesystem.

But to keep in the scope of this article I will skip all the AWS resource creation and deployment coordination, I will just focus on Kubernetes deployment manifests.

Assuming you have already deployed the EFS Provisioner creating the new storage class, requesting storage from EFS is as easy as:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
namespace: monitoring
name: grafana-persistent-storage
annotations:
volume.beta.kubernetes.io/storage-class: "aws-efs"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi

10 GigaBytes is more than enough for plugins, we could even think of less capacity, but for 3$ a month I just didn’t want to spend more time thinking about it ;)

Now, we can deploy several replicas of Grafana:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
labels:
app: grafana
component: core
spec:
replicas: 3
strategy:
rollingUpdate:
maxSurge: 6
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: grafana
component: core
spec:
containers:
- image: grafana/grafana
name: grafana
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
env:
# The following env variables set up basic auth twith the default admin user and admin password.
- name: GF_SERVER_DOMAIN
value: "grafana.example.com"
- name: GF_SERVER_ROOT_URL
value: "/"
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
- name: GF_SESSION_PROVIDER
value: redis
- name: GF_SESSION_PROVIDER_CONFIG
value: addr=ELASTICACHE_ENDPOINT,pool_size=100,prefix=grafana
- name: GF_DATABASE_TYPE
value: postgres
- name: GF_DATABASE_HOST
value: RDS_ENDPOINT
- name: GF_DATABASE_NAME
value: grafanadb
- name: GF_DATABASE_USER
valueFrom:
secretKeyRef:
name: grafana-credentials
key: grafanadb-user
- name: GF_DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: grafanadb-password
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: grafana-admin-password
readinessProbe:
httpGet:
path: /login
port: 3000
volumeMounts:
- name: grafana-persistent-storage
mountPath: /var/lib/grafana
volumes:
- name: grafana-persistent-storage
persistentVolumeClaim:
claimName: grafana-persistent-storage

These are the important changes from default configuration:

  • Database configuration
GF_DATABASE_TYPE # We chose postgres, but you could use MySQL instead
GF_DATABASE_HOST # RDS endpoint
GF_DATABASE_USER # grafanadb user fetch from Kubernetes secret
GF_DATABASE_PASSWORD # grafanadb password fetch from Kubernetes secret
  • Session Store configuration
GF_SESSION_PROVIDER # We chose redis, but memcached is also an option
GF_SESSION_PROVIDER_CONFIG # Redis config, including ElastiCache endpoint
  • Mounting the persistent storage. Under /var/lib/grafana you can have sessions and plugins, since we are storing sessions in Redis, only plugins will be there.
  [...]
volumeMounts:
- name: grafana-persistent-storage
mountPath: /var/lib/grafana
volumes:
- name: grafana-persistent-storage
persistentVolumeClaim:
claimName: grafana-persistent-storage

Conclusion

You may not need High Availability, but it’s important to not to block yourself when you need it. You don’t have to make it super expensive at the beginning. You can start with small RDS and Cache instances, small storage allocation, etc. The important thing is to allow scaling out easily when needed.

Thanks for reading it!