Log aggregation with ElasticSearch, Fluentd and Kibana stack on ARM64 Kubernetes cluster

This article was updated on 18/jan/2019 to reflect the updates on the repository, images to 6.5.4 and support to multi-arch cluster composed of X64 (Intel) and ARM64 hosts. The project will be updated so it might be newer than the one described here.

A while back, as a proof of concept, I’ve set a full logging aggregation stack for Kubernetes with ElasticSearch, Fluentd and Kibana on ARM64 SBCs using my Rock64 Kubernetes cluster. This deployment was based on this great project by Paulo Pires with some adaptations.

Since then, Pires discontinued the project but in my fork you can find all the latest files for the project including the manifests for the cluster, image Dockerfiles and build script, Kibana dashboard and detailed information in the Readme: https://github.com/carlosedp/kubernetes-elasticsearch-cluster.

Typical stack architecture

Images

All included images that depend from Java, were built using OpenJDK 1.8.0 181. I recently wrote a post testing the performance for multiple JVMs on ARM64 and found this version provides the best performance. Previously I used Oracle Java in a custom Docker image that is still in the repository.

The project is composed of the ElasticSearch base image, and the ES image with custom configuration for the Kubernetes cluster. The Dockerfiles are a little big so I won’t paste them here but you can check on the repository.

I also created an image for ElasticSearch Curator, a tool to cleanup the ES indexes after a determined amount of days and the Kibana image.

The Fluentd image was based on the Kubernetes addons image found here with minor adaptations.

All the image sources can be found in the repo and are pushed to my Dockerhub account. Having separate images for these steps makes it easier to update separate components . The images are all multi-architecture with X64 (Intel) and ARM64 with a manifest that point to both.

Deployment

The deployment is done on the “logging” namespace and there is a script to automate it and tear it down. You can either use it or do it manually to follow the deployment step by step. The standard manifest in the root level deploy a three-node cluster where all nodes perform all roles (Master, Ingest and Data). I found this to be more tailored to SBCs like the ones I use.

There is also a separate set of manifests to deploy a four-node cluster where each node have it's role(1 Ingest, 1 Master and 2 Data). To deploy this option, use the included deploy script in the separate-roles directory.

First create the namespace and an alias to ease the deployment commands:

kubectl create namespace logging
alias kctl='kubectl --namespace logging'

Full roles

To deploy the simpler version, there is a set of manifests in the root dir that can be installed with:

kctl apply -f es-full-svc.yaml
kctl apply -f es-full-statefulset.yaml

Separate Roles

To deploy in separate roles, I have a deeper description on how it’s deployed.

Services and Ingresses

There are three services on the stack, one for Kibana web interface, one for ElasticSearch API interface on port 9200 and the last one for ElasticSearch internal node communication on port 9300.

To accompany the services, there are two (actually three in my cluster) ingresses that allow external access to the stack. One for Kibana web interface, another for Kibana external access (in my case since I use two Traefik Ingresses, one for the internal domain and another for the external, internet valid domain) and the last for ElasticSearch API. Adjust the domains and need for these ingresses according to your environment.

Start Deploying the services for ES:

kctl apply -f ./separate-roles/es-master-svc.yaml
kctl apply -f ./separate-roles/es-ingest-svc.yaml
kctl apply -f ./separate-roles/es-data-svc.yaml

ElasticSearch

In ES, you have the master nodes that control and coordinate the cluster, the data nodes responsible for storing, searching and replicating the data and the client nodes responsible for the API calls and ingesting the data from the collectors. More details on each role here.

Master node

ElasticSearch master nodes can be deployed in a single instance or a quorum with many nodes. This can be controlled in the “replicas” parameter on the deployment manifest and also need to be adjusted in the NUMBER_OF_MASTER ENV in the manifest according to the documentation. In my deployment I have only one master with one requested CPU and a maximum of 512Mb of memory limited in Java’s Xms and Xmx parameters. The master also has a PersistentVolume that in my case uses a StorageClass allocated in a NFS server. More info on my Kubernetes Cluster post linked in the start of the article.

A good explanation by Paulo on the replicas and masters is quoted here:

Why does NUMBER_OF_MASTERS differ from number of master-replicas?
The default value for this environment variable is 2, meaning a cluster will need a minimum of 2 master nodes to operate. If a cluster has 3 masters and one dies, the cluster still works. Minimum master nodes are usually n/2 + 1, where n is the number of master nodes in a cluster. If a cluster has 5 master nodes, one should have a minimum of 3, less than that and the cluster stops. If one scales the number of masters, make sure to update the minimum number of master nodes through the Elasticsearch API as setting environment variable will only work on cluster setup.

To deploy the Master statefulSet, and the configMap use:

kctl apply -f es-configmap.yaml
kctl apply -f ./separate-roles/es-master-statefulset.yaml

You can check the Master was deployed and it’s logs before proceeding:

$ kctl rollout status statefulset es-master
statefulset "es-master" successfully rolled out
# List the master node logs
kctl get pods
kckctl logs es-master-0

kctl logs [master node pod]

The configuration can be overridden with the elasticsearch-configmap that is shared among all nodes.

Ingest Node

The ingest node are the ones that receive the API calls and ingest the data into the cluster. I deployed only one node but this parameter can be adjusted on the “replicas” parameter of the manifest.

The node have 2 requested CPUs, 1Gb of memory and it’s persistent data is stored in the StorageClass. The format of the file is similar to the Master node below.

Deploy it with:

kctl apply -f ./separate-roles/es-ingest-statefulset.yaml
kctl rollout status deployment es-client

You can check the node logs with kctl logs es-ingest-0 command until you see the lines similar to:

Data Node

The Data nodes are the ones responsible for the heavy work on the cluster. Since they replicate data, I’ve deployed them as a StatefulSet with two nodes and have the data in the StorageClass.

ElasticSearch defines the cluster health as colors as red, yellow and green. If you have your Data node without replication, ES will mark your cluster and your indexes (where the data is stored) as yellow. This post have a nice explanation of this. More details in checking your cluster health below.

The Data nodes are allocated with 1 requested CPU and 3 CPU limit and 1Gb Java memory. Also the set have Liveness and Readiness probes to control rollouts and restart the pods in case of problems and anti-affinity to avoid running the Data nodes in the same server.

To deploy, use:

kctl apply -f ./separate-roles/es-data-svc.yaml
until kctl rollout status statefulset es-data > /dev/null 2>&1; do sleep 1; printf "."; done

Check the node logs with kctl logs es-data-0 and kctl logs es-data-1 until you see lines similar to:

Cluster Status

First check all elements were deployed correctly in Kubernetes:

$ kubectl get svc,ingresses,deployments,pods,statefulsets -l component=elasticsearch
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/elasticsearch NodePort 10.111.92.223 <none> 9200:32658/TCP 30h
service/elasticsearch-discovery ClusterIP 10.96.98.85 <none> 9300/TCP 30h

NAME READY STATUS RESTARTS AGE
pod/es-full-0 1/1 Running 0 28h
pod/es-full-1 1/1 Running 2 24h
pod/es-full-2 1/1 Running 0 30h

NAME READY AGE
statefulset.apps/es-full 3/3 30h

After this, you can query the ElasticSearch cluster for it’s status:

$ curl http://elasticsearch.internal.domain.com
{
"name" : "es-full-0",
"cluster_name" : "myesdb",
"cluster_uuid" : "k1j8Cqw6TySMQkS6MRuYMg",
"version" : {
"number" : "6.5.4",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "d2ef93d",
"build_date" : "2018-12-17T21:17:40.758843Z",
"build_snapshot" : false,
"lucene_version" : "7.5.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}

And health:

$ curl http://elasticsearch.internal.domain.com/_cluster/health?pretty
{
"cluster_name": "myesdb",
"status": "green",
"timed_out": false,
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 24,
"active_shards": 27,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100
}

And check each node status and load:

$ curl http://elasticsearch.internal.carlosedp.com/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.36.0.4 51 91 26 4.13 5.52 4.51 mdi - es-full-2
10.32.0.15 40 85 19 0.63 1.22 1.32 mdi * es-full-0
10.46.0.11 25 90 15 1.87 1.90 2.03 mdi - es-full-1

This presents that you have a healthy cluster (all green) and the nodes “found each other”. Now on to the ingest and upkeep chains.

Cluster Manager — Cerebro

To manage the cluster, I’ve built and deployed Cerebro, a tool that evolved from kops. The manifest contains the deployment, service and the ingress to access it’s web interface. Adjust to your own domain.

Snapshot of my full-role cluster ingesting logs
kctl apply -f cerebro.yaml
kctl apply -f cerebro-external-ingress.yaml

Curator

Deploy ElasticSearch curator to keep your cluster clean and tidy. You can set the amount of days it will keep your data on the configMap before loading.

The curator creates a Kubernetes cronjob that’s ran every night. This can be adjusted in the es-curator-cronjob.yaml manifest.

...
unit: days
unit_count: 10
...

Deploy with:

kctl apply -f es-curator-configmap.yaml
kctl apply -f es-curator-cronjob.yaml

Fluentd

Finally deploy Fluentd, the daemon that runs in all your nodes and collect the logs from all deployed containers.

You can have many more options, collecting data from the node itself but this would require additional configuration. This deployment used the configs provided by Kubernetes add-on repository.

kctl apply -f fluentd-configmap.yaml
kctl apply -f fluentd-daemonset.yaml

After deploying this, you can see the data being ingested into the cluster with the command curl -s “http://elasticsearch.internal.domain.com/_cat/indices?v&s=index:desc"

Ingested data — Indexes

I’ve set a watch command to keep monitoring the ingest progress:

watch -n 5 'curl -s "http://elasticsearch.internal.carlosedp.com/_cat/nodes?v"; echo "\n\n"; curl -s "http://elasticsearch.internal.carlosedp.com/_cat/indices?v&s=index:desc"|head -30'

This might take some time depending on the amount of logs you have. In my case took more than 2 hours. Also the data nodes replicate the data between themselves so in the beginning you might see the health of each index as “yellow” but after a while they should become “green”.

Kibana

Finally Kibana, after all you want to see your data. I’ve not delved too much into customizing and studying it and the Lucene syntax for the queries but there are lots of good posts around.

Deploy it using:

kctl apply -f kibana-configmap.yaml
kctl apply -f kibana-svc.yaml
kctl apply -f kibana-deployment.yaml
kctl apply -f kibana-external-ingress.yaml

And access on the URL http://elasticsearch.internal.domain.com based on your ingress configuration.

First you need to tell Kibana how to identify your index patterns. The application helps you out on this in the “Discover” menu but it’s just a matter of creating a new index pattern with “logstash*” as the index name and associating the @timestamp as the time series field.

You can also see and search the logs in “Discover” and add a couple of columns to ease up visualization like I did with the node name, pod name and the log itself:

This is a simple dashboard I created to show the amount of logs ingested, the distribution between all pods and below it I filter some errors and count them.

Showing the logs and error logs:

Also a heat-map:

To create this, I’ve set some visualizations and assembled them into the Dashboard:

I’ve exported my data and uploaded it to the repository. You can import on Management -> Saved Objects -> Import. I’ve never tried importing data so YMMV. It’s also included

Conclusion

This stack can give you a good overview and a pretty functional log aggregation platform. Just be aware that due to limitations on CPU and memory in these SBCs, the response times for queries might be a little high (I had to adjust Kibana timeout config to handle this).

Please send me feedback, corrections and suggestions on http://twitter.com/carlosedp and issues or pull requests on the project site in https://github.com/carlosedp/kubernetes-elasticsearch-cluster.