Preparing your logging stack for a 10x scale using ELK & Kafka on Kubernetes

By — Jatin Gupta(Platform Engineer)

Published in

Urban Company – Engineering

6 min readMar 13, 2023

Introduction

At Urban Company, we have multiple sources consisting of 120+ microservices, standalone servers like databases, stateful apps, etc. pipelining their logs to an elastic search cluster.

These use cases have a high search and ingestion throughput. During peak hours, the ingestion rate goes up to 15M events per minute.

Previous Architecture at UC

Earlier we had a single elastic search cluster that was deployed on standalone EC2 servers and was configured and maintained manually.

Log pipeline journey :

All microservices and standalone sources used to send logs to logstash via filebeat.
Logstash maintained a queue at its end, did some transformations, and sent the logs to elastic search.
Kibana queried the elastic search to display the logs.

Challenges in the previous architecture

The following were the major challenges faced in the above architecture:

Maintenance

Version upgrades, addition/deletion of new nodes in the cluster, and disk addition activities were manual.
This required connecting to EC2 servers via ssh which poses a stability risk.

High Availability

Our ELK stack is largely deployed on spot instances. It used to take ~15–20 mins to recover from spot replacement.

Scalability

A high ingestion rate from a particular microservice used to overwhelm the cluster and created a lag on the elastic search for up to 3 hours.

Design choices while moving to Kubernetes

During this journey, we compared different options as explained below.

ECK ( Elastic operator for Kubernetes ) or helm charts?

We went ahead with ECK as it has built-in automation, a healing mechanism, and best practices as compared to the helm chart.
The operator framework is built on a reconciliation loop that continuously checks if the cluster is in the desired state or not.
ECK provides a code-controlled way to solve all maintenance tasks.
Version upgrades are done in a rolling way with an auto-revert option.

Should we introduce a message queue between filebeat and logstash ?

We earlier used logstash itself to maintain persistent queues in case of unavailability of the elastic search cluster.
We introduced Kafka as a queue between logstash and elastic search.
We used Strimzi Operator for deploying Kafka in Kubernetes as it is aligned with ECK in solving all similar problems while managing Kafka.

How to manage spot instances?

We moved from spot.io to Karpenter for provisioning the spot nodes in our Kubernetes cluster and for handling node autoscaling.
This allowed us to scale faster and reduced our provisioner licensing cost to 0.
We use node-termination-handler to handle spot interruptions gracefully. This helps in scheduling pods on other nodes as soon as a spot replacement event is fired from AWS.

New architecture on Kubernetes

Filebeat is running alongside all microservice containers which pushes the logs to kafka.
Kafka is introduced between filebeat and logstash to manage the queuing workload.
Each team has its own elasticsearch and logstash cluster.
A single kibana is exposed for all the clusters for ease of querying.

How did we solve the above challenges in the new setup?

Maintenance challenges

Problem: Version upgrades, disk addition, addition/deletion of new nodes, etc. were manual and had to be a planned activity.

Solution: Thanks to ECK, all these changes are now code driven with a proper review process.

Sample elastic search deployment file -

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: product-growth
  namespace: product-elk
spec:
  version: 7.17.3
  volumeClaimDeletePolicy: DeleteOnScaledownOnly
  http: ...
  nodeSets:
    - name: mada-nodes
      count: 3
      config:
        node.roles: ["master", "ingest", "data"]
      podTemplate:
        metadata:
          labels:
            team: growth
        spec:
          initContainers: ...
          affinity:
            nodeAffinity: ...
          containers:
            - name: elasticsearch
              resources: ...
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 60Gi
            storageClassName: es-gp3-sc

Example: Increasing disk size of an elastic search cluster -

Spot recovery time

Problem: We were using spot.io to manage our spot instances.
It performed the following steps on any spot replacement which used to take around 10–15 minutes in creating a new EC2 with the same data -

Solution: In our new setup where we use Karpenter and node-termination-handler, our pod automatically gets scheduled on another node reducing the spot replacement duration to 1–2 minutes.

Lag in Elastic Search

Problem: Spikes in throughput during peak hours used to cause a lag in elastic search.

Solution: To solve this, we have split our single elasticsearch cluster into 4 separate clusters based on the team. This provides us separation of concern; i.e., issues in one cluster will not impact others.

Heavy load in logstash

Problem: With the increase in queue size under heavy load, our logstash cluster used to get unresponsive.

This was because we used logstash as a persistent queue between filebeat and elastic search.

Solution: We have now offloaded the queuing workload to Kafka which is highly performant in moving large amounts of throughput from source to destination.

What other benefits did we get out of this migration?

Performance gain

Elasticsearch’s performance has now improved as the workload is distributed in different clusters with faster recovery.

In our old setup, the elastic search cluster’s health used to become red around 20–30 times in a single month as shown in the below snapshot -

Cluster red occurrences in the old setup

This is now reduced to a rare occurrence happening only once a month.

Cluster red occurrences in the new setup

Cost Savings

We are saving ~$3000 / month with our new ELK setup through effective resource sharing in EKS and the use of Graviton-based instances wherever possible.
We were using spot.io to manage our spot instances. Now, we are using Karpenter to provision and manage spot instances saving us the amount we used to pay spot.io.

Along with offering a generous 30-day log retention period without requiring additional disk space, we have implemented several other enhancements. Keep an eye out for our upcoming blog post dedicated to these improvements.

The team :

Ashu Kaushik is constantly seeking new challenges and opportunities to push the boundaries of what’s possible in the field of technology.

Jatin Gupta is a talented tech engineer with expertise in developing and implementing complex infrastructure solutions.

Karan Bansal is an engineering leader with a passion for all things tech. He excels at collaborating with cross-functional teams to deliver scalable solutions that drive business success.

Kushal Singh has a proven track record of delivering top-quality software solutions that exceed expectations and is always eager to take on the next big thing in tech.

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger). Help us build a community by sharing on your favorite social networks (Twitter, LinkedIn, Facebook, etc).

You can read more about us in our publications —
https://medium.com/uc-design
https://medium.com/uc-engineering
https://medium.com/uc-culture

If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com