Scaling out Grafana with Kubernetes and AWS

4 min readNov 14, 2017

When production workloads that service multiple customers which have millions of users, monitoring those workloads become almost as important as the workloads itself.

I don’t believe that monitoring is necessarily exclusive to technical operations teams. Now a days it’s easier than ever to get data which may be of your interest: Managers might want to have access to billing dashboards and alerts, Java developers may want to measure how many times the Garbage Collector was triggered and what was the status of the JVM Heap when that happened, Operations team may want to have a global vision of production platform including blackbox and whitebox monitoring, system metrics, databases, etc.

But that does not mean that for a few users you have to create a expensive, over-sized and over-redundant platform. You just need to be prepared for the moment to scale horizontally without pain.

Grafana

Grafana is an open-source tool to visualize data that comes from multiple Data Sources: ElasticSearch, Graphite, Prometheus, and much more. (Find the complete list here: http://docs.grafana.org/features/datasources/)

If you deploy Grafana official Docker image in your Kubernetes cluster, it works out-of-the-box with the default parameters. It will store sessions and plugins on disk and will use SQLite db to store all dashboards and users.

Problem is, that default architecture will not scale. State is fully coupled with the application and we need to decouple it to deploy multiple replicas of Grafana. So let’s list how we can move the state out of Kubernetes using AWS services:

Grafana database: Grafana supports MySQL and Postgres, so we can use RDS for this.
Grafana sessions: Grafana recommends using Redis or Memcached to cache users sessions. Hopefully, AWS provides ElastiCache supporting both of these in-memory key/value store engines with multiple options for clustering.
Grafana plugins: Grafana needs to store plugins on filesystem. If we install a plugin, we just need other Grafana containers being able to load them. Any shared filesystem could serve for this purpose and AWS provides EFS to solve this. In order to mount this storage as Persistent Volumes from EFS in Kubernetes, there’s a external storage plugin called EFS Provisioner. Here there’s a nice example on how to configure it.

Deployment

We automate everything from infrastructure and provisioning to deployment. Our stack is based on Kubernetes on top of AWS to deploy containerized monitoring tools, operations automated jobs (Jenkins Slaves) and more (If you are curious about how to create a fully automated private Kubernetes Cluster, you can check out this other article I wrote). Terraform will manage infrastructure components: VPCs, Subnets, EC2 Instances, RDS, ElastiCache, EFS, etc.

We use Ansible for provisioning and coordinating the deployment, acting as a link between the infrastructure and the container deployment itself. So we describe the resources (EFS, ElastiCache, RDS) filtering specific tags and once we have identified the resource we render a jinja2 template that represents the Kubernetes Grafana deployment.

We are so crazy about automation that even for the EFS provisioner we launch a micro instance, we mount the EFS filesystem, create the base directory if it’s not created, we kill the instance and then we deploy the EFS provisioner pointing to the new allocated filesystem.

But to keep in the scope of this article I will skip all the AWS resource creation and deployment coordination, I will just focus on Kubernetes deployment manifests.

Assuming you have already deployed the EFS Provisioner creating the new storage class, requesting storage from EFS is as easy as:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: monitoring
  name: grafana-persistent-storage
  annotations:
    volume.beta.kubernetes.io/storage-class: "aws-efs"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

10 GigaBytes is more than enough for plugins, we could even think of less capacity, but for 3$ a month I just didn’t want to spend more time thinking about it ;)

Now, we can deploy several replicas of Grafana:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
    component: core
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxSurge: 6
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: grafana
        component: core
    spec:
      containers:
      - image: grafana/grafana
        name: grafana
        resources:
          # keep request = limit to keep this container in guaranteed class
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
        env:
          # The following env variables set up basic auth twith the default admin user and admin password.
          - name: GF_SERVER_DOMAIN
            value: "grafana.example.com"
          - name: GF_SERVER_ROOT_URL
            value: "/"
          - name: GF_AUTH_BASIC_ENABLED
            value: "true"
          - name: GF_AUTH_ANONYMOUS_ENABLED
            value: "false"
          - name: GF_SESSION_PROVIDER
            value: redis
          - name: GF_SESSION_PROVIDER_CONFIG
            value: addr=ELASTICACHE_ENDPOINT,pool_size=100,prefix=grafana
          - name: GF_DATABASE_TYPE
            value: postgres
          - name: GF_DATABASE_HOST
            value: RDS_ENDPOINT
          - name: GF_DATABASE_NAME
            value: grafanadb
          - name: GF_DATABASE_USER
            valueFrom:
              secretKeyRef:
                name: grafana-credentials
                key: grafanadb-user
          - name: GF_DATABASE_PASSWORD
            valueFrom:
              secretKeyRef:
                name: grafana-credentials
                key: grafanadb-password
          - name: GF_SECURITY_ADMIN_PASSWORD
            valueFrom:
              secretKeyRef:
                name: grafana-credentials
                key: grafana-admin-password
        readinessProbe:
          httpGet:
            path: /login
            port: 3000
        volumeMounts:
        - name: grafana-persistent-storage
          mountPath: /var/lib/grafana
      volumes:
      - name: grafana-persistent-storage
        persistentVolumeClaim:
          claimName: grafana-persistent-storage

These are the important changes from default configuration:

Database configuration

GF_DATABASE_TYPE # We chose postgres, but you could use MySQL insteadGF_DATABASE_HOST # RDS endpointGF_DATABASE_USER # grafanadb user fetch from Kubernetes secretGF_DATABASE_PASSWORD # grafanadb password fetch from Kubernetes secret

Session Store configuration

GF_SESSION_PROVIDER # We chose redis, but memcached is also an optionGF_SESSION_PROVIDER_CONFIG # Redis config, including ElastiCache endpoint

Mounting the persistent storage. Under /var/lib/grafana you can have sessions and plugins, since we are storing sessions in Redis, only plugins will be there.

  [...]
    volumeMounts:
      - name: grafana-persistent-storage
        mountPath: /var/lib/grafana
   volumes:
     - name: grafana-persistent-storage
       persistentVolumeClaim:
         claimName: grafana-persistent-storage

Conclusion

You may not need High Availability, but it’s important to not to block yourself when you need it. You don’t have to make it super expensive at the beginning. You can start with small RDS and Cache instances, small storage allocation, etc. The important thing is to allow scaling out easily when needed.

Thanks for reading it!

Scaling out Grafana with Kubernetes and AWS

Grafana

Deployment

Conclusion

Written by Fernando Crespo Grávalos