Launching Worker Pod Autoscaler — Solving specific problems with worker scaling in Kubernetes

Published in

Practo Engineering

6 min readDec 12, 2019

When you are finding a doctor, booking an appointment or consulting a doctor online on Practo, a lot of work is done asynchronously using Job queue systems (a.k.a workers). This is a critical part of our architecture and it plays an important role in making our systems scalable and reliable.

We use AWS SQS as the message queue and we process on a scale of about 800 million jobs per month. These workers are run as a Kubernetes Deployment with every worker running as a separate pod. This helps to deploy, scale, monitor and troubleshoot each worker process independently.

This article focuses only on the “scaling part” of these worker deployments.

Implementation 1 — Using Custom Metrics and HPA

When we started with Kubernetes we used HPA with its Custom Metric Support and Prometheus to scale the workers. This is also the default recommended solution from the community for scaling worker pods.

Horizontal Pod Autoscaler(HPA) is a controller which scales Kubernetes Deployment typically based on CPU metric and supports scaling Kubernetes Deployment based on any custom metric. The only requirement is the user needs to make sure the custom metric is available for consumption by the Kubernetes process.

We exported the SQS Cloudwatch metrics to Prometheus by making the Prometheus Cloudwatch Exporter compatible for this use case. Here is the code for it — https://github.com/practo/cloudwatch_exporter

This was our first implementation of scaling our background workloads. We ran this for about a year or so. Based on the feedback from the various service teams and our experience, the following were the problems with this implementation:

Slow scale-ups: A lot of services required scaling to be near real-time. If there is a surge in the messages in the queue then the workers should scale instantly. But this was not happening because the Cloudwatch metrics are only available at 5-minute intervals for SQS. Solving for this required a combination of long-polling, SQS metrics API and Cloudwatch metrics.
Scaling down was causing issues: HPA does not allow to scale up and down based on different metrics. But it can be different based on the use case. For example, we were scaling up on SQS metric — ApproximateNumberOfMessagesVisible but we could not scale down based on the same metric because for high throughput workers when they were consuming the queue very fast this metric was always zero.
Non-idle workers were scaling down: Since it was a target-based scaling when the Queue length metric dropped it started to scale down the workers which were not idle.
Costs were not optimized: Since metrics were polled from Cloudwatch at fixed intervals and not intelligently as required, this incurred a significant AWS cost towards Cloudwatch. This can be optimized by fetching only the relevant metrics only when we require them.
Using custom metrics was not as straightforward as running a few kubectl commands and required a good effort from the user to make the whole setup work. Also, HPA was requiring us to decouple the metrics from the scaler which means we cannot derive the scaling metrics using the currently running pods already known to the scaler.

So to solve the above problems, we wrote Worker Pod Autoscaler.

Implementation 2 — Worker Pod Autoscaler

practo/k8s-worker-pod-autoscaler

Scale Kubernetes pods based on the Queue length of a queue in a Message Queueing Service. Worker Pod Autoscaler…

github.com

Scaling workloads based on queue length is a fairly common use case. Therefore we decided to build a generic, performant and very easy to use open-source autoscaler for workers using Kubernetes CRDs. Following were the primary goals of this project:

Superfast scale-ups: Near real-time scaling of workers. As soon as a job comes in the queue the containers should scale if needed.
Solve the scale down problems for the workers: The solution should not scale down workers that are processing a job. Scale down metric should not be restricted to the same metric as scale-up, it should be based on the queue provider.
Built-in Queue Exporters: Kubernetes HPA with custom metrics requires org effort in exporting custom metric and storing them to be made accessible to the Kubernetes API. This makes sense for Kubernetes to support several use cases. But the current HPA implementation is restrictive to use for queue-based scaling. Also, built-in help in focussing the user worry about the application and frees the user from maintaining metric exporters. The project should build generic metric exporters (pollers) with every queue provider integration.
Support for scaling up from and down to zero pods to optimize costs.
Ease of Use: The whole setup of worker autoscaling should start working by running a few kubectl commands.
Generic Platform Independent Autoscaler: The autoscaler should extend to support any message queuing service and should work in any cloud or on-premise Kubernetes environment.
Open Source: The solution should be completely open-sourced for a bigger community to use, contribute and make it better.

Now, let’s dig into how Worker Pod Autoscaler aka WPA works!

Worker Pod Autoscaler

When the WPA controller pod starts it install the WPA custom resource in Kubernetes if it is not present and it starts the following go-routines:

WPA Goroutine: WPA controller go-routine is the control loop which scales the deployment to the required number of replicas based on the queue specified for the deployment.
Poller Goroutine(s): There is a poller manager who starts and stops go-routines (poller) for each WPA resource. Each poller polls the queue provider metrics for the exact metric keeping the API calls to minimum. The sync intervals here can be configured based on use cases. The concurrency here is one of the reasons for near real-time scaling.

The algorithm for deciding the desired replicas can be read here.

Explaining WPA working with a simple example.

Install WPA

Worker pod autoscaler can be easily installed using hack/install.sh. Please follow this for detailed information.

Example

Create a deployment and a WPA resource to manage the scaling of it.

kubectl create -f artifacts/example-deployment.yaml
kubectl create -f artifacts/example-wpa.yaml

WPA will start scaling example-deployment from 0 to 30 if the messages in the queue are greater than 150. The workers will scale down to 0 when there are no jobs in the queue.

That’s it! Two kubectl commands and your worker deployment starts scaling!

We have been running WPA in production from the last 4 months. We have got great feedback from our developers in Practo. It is now in v0.2.2 release.

**Unused workers were stopped reducing memory usage from 2 GB to 100 MB for one of the services.**

**Cloudwatch costs came down by 50% because only relevant metrics are fetched more intelligently.**

Future Work

SQS is the only queue provider supported at present. But it can easily be extended to any other queue provider. This interface needs to be implemented and a simple poller like this needs to be written down to bring in a new queue provider support! Pull requests are welcome :)

Thanks to the Kubernetes team for making CRDs and sample controller. Please feel free to open issues in Github if you have any questions or concerns. We hope WPA is of use to you!