Auto scaling in Kubernetes using Kafka and application metrics — part 1

8 min readMay 31, 2022

This is a case study on auto scaling long running jobs in Kubernetes using external metrics from Kafka and the application itself. The setup uses the recommended mechanisms to get the job done and saves us from a lot of work compared to implementing this ourselves.

Requirements

First things first, this application has rather specific requirements that are somewhat at odds with more normal applications: it runs potentially long running data analytics jobs that must not be shutdown more often than necessary. Pertaining to the auto scaling side of things, the following summarizes the properties we want to achieve:

A job must start as soon as possible.
The cost should be as low as possible.
No job should be shutdown unless necessary.

These requirements are satisfied by auto scaling to respond to varying load and keep the computing costs at a minimum by scaling down unused resources. Point 3 in practice means that we accept that jobs get shutdown during a rolling update but not as a consequence of scaling down. Let’s keep this in mind until later and tackle the basics first.

The metrics

As an orientation I’ll just mention that the application consists of a couple of micro services and we will call the service running the analytics jobs the executor service. An instance of the executor service runs one job at a time and the jobs are started in a master — worker fashion where the master distributes jobs the description of which are provided as Kafka messages.

Now we have the information needed to articulate the metrics needed to implement the criteria for scaling up vs down:

Kafka consumer lag
Idle executors metric

The consumer lag metric just means that if we have jobs queued up in the Kafka topic this means that we want to scale up. (Distributing a job to an idle executor is a very quick operation so we can assume that if there are idle executors there will be no Kafka consumer lag). Since the jobs are potentially long running we really want to scale up even if there is just one single job waiting to be started. The executors busy running other jobs could be busy for hours. To be more exact the consumer lag is the difference in offset and consumer group offset per Kafka topic and partition.

The idle executors metric :

(total executors — idle executors + 1) / total executors

serves two purposes. Firstly is renders a metric < 1 if there is more than one idle executor. This is desirable due to requirement 2, we don’t want to pay for idle Kubernetes pods just sitting around. (Note that the Kafka consumer lag metric alone does not suffice since even if its value is zero, all executors may be busy and should absolutely not be scaled down.)

Secondly, due to requirement 1 there must always be one idle executor since otherwise it takes longer time than necessary to start a job. In other words, if there are zero idle executors a new executor instance will be started to make our execution environment always be one step ahead. Of course, the metric can be altered to achieve other properties, this is just a study of one particular case.

With this as a background it’s time zoom in on the K8S auto scaling apparatus with HorizontalPodAutoscaler as the centerpiece.

Kubernetes auto scaling

There is a great deal to be said about auto scaling in Kubernetes but I will confine this text to just zoom in on the case at hand. The scaling is achieved by creating a feedback loop where the HorizontalAutoScaler (HPA) uses the Kubernetes metric server to obtain the data it needs and then manipulates the replicas attribute of the deployment. Yes, the executor service is just an ordinary Kubernetes deployment making it very straightforward to deploy and upgrade.

The diagram shows more or less the complete picture of the components involved.

The External Metrics API is Kubernetes own take on a metrics server for metrics that can originate from anywhere. If there is metrics in Prometheus that we want to use, Prometheus Adapter is a service that will bridge between Prometheus and the provide an implementation of the K8S external metrics API.

The rest the diagram on the right side is just the paths from Prometheus Adapter to the sources of information: Kafka and our own application where the distributor service happens to be the exporter of the total executors and idle executors metrics.

Between Prometheus and each source of metrics there is a K8S artifact of kind serviceMonitor that we must configure. Its function is to collect the data from the source and make it available to Prometheus. In turn, Prometheus will scrape data from all serviceMonitors in its namespace. A serviceMonitor may collect metrics from multiple namespaces.

Let’s reiterate this for the Kafka metrics just to go through the steps one by one:

Kafka has metrics somehow made obtainable.
Kafka Exporter collects data from Kafka and exposes this via an HTTP endpoint.
A serviceMonitor in K8S defines the application from which Prometheus will scrape the metrics periodically. The serviceMonitor is a CRD (custom resource definition) and is best understood as a way to provision Prometheus to know what endpoints to scrape metrics from.
Prometheus pulls metrics periodically from Kafka Exporter using the information defined by the serviceMonitor.
Shifting the perspective starting from the HPA it will monitor the external metrics API.
When the HPA requests information from the K8S external metrics API it in turn fetches the underlying metrics from Prometheus.
The HPA scales the deployment up or down based on a change in the metrics.
The metrics will change as a consequence of the scaling operation and the feedback loop is closed.

HPA basics

The HPA has a number of interesting properties, let’s briefly cover a few with the sample manifest showing how to configure them below. (See the comprehensive documentation on the K8S documentation for more details)

There can be multiple metrics at use, which we have seen is necessary o accomplish our goal. The largest of these will be used for the scaling decision.
The scaling increment can be specified in different ways (e.g. absolute value or percentage). Here we chosen the absolute value at most 3 pods in 15 seconds.
The behavior can be different scaling up vs down which is not utilized in this configuration, except for the stabilization window attribute in the scaleDown section. This property reduces flapping i.e. undesired scaling in case the metrics fluctuate a lot.

4. Not shown in this configuration but worth pointing out is the possibility of using multiple policies for scaling up vs down, in which case the largest value will be used.

5. Lastly, the target replica count is calculated as

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

which in our case means that the Kafka consumer lag will have an unproportionally large influence on the desired replicas. This is mitigated by using the absolute value of at most 3 new pods in 15 seconds. We will want to continue scaling up as long as there is consumer lag.

HPA manifest

Installation

Now let’s get our hands dirty bringing this to life on Kubernetes. Note that in case you only care about the Kafka consumer lag metric (perhaps you don’t have an application metric yet or don’t need one) simply leave out the parts dealing with the application metric.

Assuming Kafka, Kafka Exporter and Prometheus are already installed we now need Prometheus Adapter, the Service Monitors and the HPA together with the roles needed to use the K8S metrics. In summary the steps are:

Install Prometheus Adapter
Install serviceMonitors
Install authorization manifest for the HPA
Install the HPA

As a side note, be aware that the the setup the author used was installed with Prometheus Operator and there may be other ways to achieve the same results with other setups. The procedure described here has been verified to work with Prometheus Operator.

Install Prometheus Adapter using the command and values.yaml below. Change the following according to your environment:

The namespace preferably should be the same as where Prometheus is running
The value of prometheus.url can be obtained with the command `kubectl -n your-namespace get svc ` , look for the Prometheus service.

In case you don’t have the prometheus-community Helm repo registered, also use the helm repo add command:

helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm -n your-prometheus-namespace install -f  values.yaml prometheus-adapter prometheus-community/prometheus-adapter

Prometheus Adapter values

To study the rules configuration:

kubectl -n your-prometheus-namespace get configMaps prometheus-adapter -o yaml

Now we add the serviceMonitors in the same namespace. Note that the label metadata.labels.release is crucial, at least if your are using Prometheus operator. Without the correct value, it will not work. Also note that your can have multiple namespaces in the namespaceSelector property.

kubectl -n your-prometheus-namespace apply -f serviceMonitor.yaml

Example application serviceMonitor manifest followed by example Kafka metrics serviceMonitor.

Sample application serviceMonitor manifest

ServiceMonitor for Kafka Metrics:

At this point you can verify the installation thus far by checking that there is content delivered from the K8S external metrics server:

kubectl get --raw /apis/external.metrics.k8s.io/v1beta1

Which will render a response containing the external metrics, make sure you see your metrics whether you specified only the Kafka consumer lag or both.

To see the actual value of the Kafka consumer lag metric:

k get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/your-application-namespace/kafka_consumergroup_lag?labelSelector=consumergroup%3Dyour-kafka-consumergroup%2Ctopic%3Dyour-kafka-topic"

Now only the two HPA related manifests remain, first let’s install the ClusterRole and corresponging ClusterRoleBinding to give the HPA access to the metrics by applying the manifest:

And finally apply the HPA manifest above (right before the installation heading). Remember to replace the consumer group and topic properties with the relevant values for your setting.

After a while, you can inspect the status of the HPA using

kubectl -n your-application-namespace describe hpa your-hpa-name

This gives plenty of information about the current metrics values, scaling history and errors along the way.

Observability

Observability is key to ensure fast and reliable monitoring and naturally Grafana will be the obvious candidate to study the results. If you use Prometheus Operator to install Prometheus you already have Grafana installed. It’s a pleasure studying the behavior of auto scaling together with the metrics of interest. For instance, if you add a metric for the latency of the processing of Kafka command you get concrete measurable feedback on the requirement that a job must start as quickly as possible. Here is an example showing a Grafana dashboard with total executors, idle executors and latency (defined as the duration from the time the Kafka message is published to the topic until it is processed by the application). We see that the stabilization window does its work to achieve a pretty smooth scaling cycle although the idle executors is a bit volatile.

Long running processes

Now let’s remember the topic of auto scaling long running processes which clearly poses a problem in our setup thus far. The problem to solve is that with our current setup, scaling down will shutdown executor nodes no matter if they are busy running a data analytics job or not. The contract between Kubernetes and the pods is that the pod will have a configurable time to clean up resources and finish its current business, e.g. serving a web request, but this cannot take hours. This problem turns out to be separate from the content covered in this text, and is therefore covered in part 2 of this article.

References

https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/horizontal-pod-autoscaler-v1/