DevOps — Setting up ELK
In this update from DevOps, we’re going to be discussing logging and the solution which we desperately needed so that devs could work out what was happening within the platform. If you haven’t already looked do check out our previous blog post which starts off our journey on Postgres Clustering.
Logging is a vital area for our development team to be able to trace and debug problems that could occur and we need to ensure that they
Around a month ago we began looking at the ELK stack and how we could utilise this within our Kubernetes platform. We wanted to establish one location for all the pods stdout/stderr logs. We needed to ensure devs had full access to these at any one time, and to provide granularity in visualising the data.
The first deployment took a couple of weeks to figure out the stack and get used to the tool(s) and how they all interlink. We first tried to launch the Elasticsearch Docker image into kubernetes a single pod. We had a single Kibana pod and a Fluentd pod running the log gathering. This worked well for a couple of weeks. However, both Elasticsearch and Fluentd ended up killing the entire node, due to the amount of memory consumption.
This was a failure on my part as I was missing a number of things. First in the Elasticsearch deployment I’d hadn’t included the following resource limitations and the Java heap space values. The pod was always crashing as it had used its entire heap.
- name: ES_JAVA_OPTS
value: "-XmsVALUEm -XmxVALUEm"
The problem with Fluentd, was that I hadn’t specified the resource limits and requests as above. So a node running 32GB Ram, which Fluentd was deployed on, was dying because it ran out of memory, as I hadn’t limited Fluentd’s allocated ammount.
The benefit to limiting our pods resources means that if it reaches its limit, the pod itself will die, not the entire node. When running in production, you’d rather loose one pod than an entire nodes worth.
The following image represents the second way in which I had looked at deploying the stack;
It seemed quite straight forward but in the end it was a lot of pods, on a cluster where pod space was quite limited.
Our solution that we managed to get working is a simple single elasticsearch pod, we’ve been sure to limit the cpu and memory allocations and increase the JHS to ensure that the pod can cope.
We then have a standard deployment of Kibana that runs on half a gig of ram and 250m of CPU. Handles well.
Finally Fluentd, is deployed on as a Daemonset on each node, forwarding our logs into Elasticsearch and runs on a restricted 200mb. We have a configmap in place so that we can update the config easier.
X Pack was another huge problem during this entire process, as the newer docker images all container X-pack by default. I couldn’t for the life of me find a way to remove/disable X-pack without fully removing it from the image.
X-Pack offers some great features that could be really useful, but the cost of using this plugin is too great for the level of logging currently required for our team.
In the end we created a fresh Docker image, based on the original Elasticsearch and Kibana, but removed the X-pack plugin.
We also needed to think of a way to capture logs from sources external from that of Kubernetes. i.e dev servers yet to be decommissioned, logs from services that infra department requires.
We also launched a Graylog instance so that logs can be forwarded through UDP into elasticsearch. This was the simplest way for us, as fluentd seemed unable to take multiple <source> tags to process information coming from pods and external sources.
Graylog uses Elasticsearch too, which in turn has the logs visibility show up in Kibana. Graylog of course has its own way of handling logs and visualising that data, but we thought we would keep this consistent and have the data coming from Kibana for the time being.
The next challenge is to create a multitude of visualisations that the dev team and infra team can explore and use to their advantage. This could be quick samples of how many times a specific debug level (info, warn, error, etc), to detailed representations of how many connections a service has reported or lost.
This is an on-going project and I’m sure people will agree there are many ways we can improve our deployment. I’m currently exploring how we can return to the (probably more practical way of deploying ELK) cluster setup as described above. As our current solution probably won’t scale very well.
We would be interested in any of your insights, or suggestions as we’re always looking to improve our deployments for logging and monitoring. Feel free to leave us a comment below.