Kubernetes autoscaling based on custom metrics without using a host port

How to set up horizontal pod autoscaling based on application-provided custom metrics on minikube

NOTE: This blog post describes the old (and now deprecated) version of Kubernetes Horizontal Pod Autoscaling.

In Kubernetes, setting up horizontal pod auto-scaling based on CPU utilization is pretty straightforward. Making the autoscaler use application-provided custom metrics, on the other hand, is much more involved, especially because the documentation on the subject is very poor and misleading and because autoscaling based on custom metrics is still an alpha feature.

After one failed attempt at setting up autoscaling of a small NodeJS test web app based on the queries-per-second (or requests-per-second) metric a few days ago, I gave it another try and managed to get the autoscaler working properly. I did this on a single-node minikube cluster, which adds its own set of additional problems, because currently, custom metrics can only be exposed through a port on the worker node itself (so-called host port). Since multiple pods can’t share the same host port, this basically means you can’t run more than a single pod replica in the single-node cluster. Similarly, in a multi-node cluster, you can only run as many pod replicas as there are cluster nodes (at most one pod per node).

So, it may seem like you can’t really try out custom-metrics-based autoscaling on a single node cluster. Luckily, that has proven not to be true. There is a way to set it up without having to use a host port. This blog post will show you how.

A quick introduction to horizontal pod autoscaling

Before I go into the details of exposing custom metrics and configuring autoscaling on top of them, let me just quickly describe how horizontal pod autoscaling works in Kubernetes.

One of the basic Kubernetes features is the ability to manually scale pods up or down horizontally simply by increasing or decreasing the desired replica count field on the Deployment, Replica Set (RS) or Replication Controller (RC). Automatic scaling is built on top of that. To make Kubernetes scale your pods automatically, all you need to do is create a HorizontalPodAutoscaler (HPA) object, just like you would any other Kubernetes object. In the case of CPU-utilization-based autoscaling, the controller will then start observing the CPU usage of the pods and scale the Deployment/RS/RC so the average CPU utilization across all of the pods is kept close to the target CPU utilization configured in the HPA object.

The CPU usage (and other metrics) is collected by cAdvisor, which runs inside the kubelet. All those metrics are then aggregated by Heapster, which runs as a single pod (on one of the nodes) and collects metrics from all nodes in the cluster. The autoscaler controller gets the metrics from Heapster, so it needs to be running for autoscaling to work.

The components involved in autoscaling

Enabling custom metrics collection in minikube

In minikube, Heapster is not enabled by default. It is, however, available as an add-on and can be enabled with the following command:

$ minikube addons start heapster

While this enables the cluster-wide collection of pods’ CPU and memory usage, it doesn’t enable the collection of custom application-provided metrics. To enable it, the kubelet needs to be run with the --enable-custom-metrics option. When using minikube, this needs to be done when starting the cluster, like this:

$ minikube start --extra-config kubelet.EnableCustomMetrics=true

The cluster is now ready to collect custom application metrics. Now, we’ll see how we can deploy a simple NodeJS app that exposes a QPS metric (QPS is short for queries per second, or HTTP requests per second in our case). The returned metrics need to be in Prometheus format, which is pretty simple, as you’ll see below. The response to a request for metrics is just a plain text response with one metric per line. In the case of a single QPS metric, the response would look something like this:

# TYPE qps gauge
qps 15.42

In this example, the current qps is 15.42. Additional metrics would be represented by additional lines (further info on the format here: Prometheus Exposition formats).

Configuring Kubernetes to collect the app’s QPS metric

Of course, we need to tell Kubernetes where to collect the metrics from. This is done by including a definition.json file in the /etc/custom-metrics/ directory inside the container. According to the docs, the file must be mounted into the container through a ConfigMap volume. It must contain the URL where the metrics are exposed:

{
"endpoint": "http://localhost:9000/metrics"
}

And here’s the first caveat. Because cAdvisor runs in the host network namespace (not the one the pod runs in), localhost refers to the worker node, not to the pod. Because of this, the NodeJS process running in the container needs to be bound to a port on the node itself (so-called host port). For example, to bind a container port to host port 9000, you would add hostPort: 9000 to the pod’s spec:

spec:
containers:
- image: luksa/kubia:qps
name: nodejs
ports:
- containerPort: 8080
hostPort: 9000

Obviously, multiple pods running on the same node cannot use the same host port. This makes autoscaling on a single-node minikube cluster impossible, since only a single pod replica can run on the single node. And even on a proper multi-node cluster, you’d only be able to run as many pod replicas as there are nodes.

When using a host port, only a single replica can run on each node

So, can we work around this?

The alternative to using a host port

Since the main problem stems from the fact that only a single pod instance can bind to a specific host port, maybe we could try using a different port for each pod instance. This would be easy if we were creating the individual pods manually, but since our pods are all created from the same pod template (the host port number is specified in the pod template), there’s no way to make replicas use different host ports.

So, we can’t use an individual host port on each replica and we can’t use the same host port on multiple replicas. What other option is there?

How about not using a host port at all.

While the documentation says that localhost in the definition.json file refers to the node and we must consequently use a host port, that’s really not necessary. Instead of referring to localhost, we can simply refer to the pod’s IP address and the pod’s own port.

But the documentation also states that the definition.json file needs to be defined in a ConfigMap and mounted into the container through a ConfigMap volume. Since we want to point to the individual IP of each pod, it seems like we’d need a separate ConfigMap for each pod replica, which isn’t achievable. Even if it was, we’d still need to know the pod IPs up front when creating the ConfigMaps, but they aren’t known until after the pod is scheduled.

It appears as though we can’t use the individual pod IP approach. Or can we?

Not using a ConfigMap

As it turns out, the documentation is wrong. While it really is necessary for the /etc/custom-metrics/ directory to be mounted from a volume (see here), it doesn’t need to be a ConfigMap volume. It can simply be an Empty Dir or any other type of volume.

So, all we need to do is mount an Empty Dir volume and write a definition.json file to it at pod start-up. We can do this in the main process in the container itself, or we can use an Init Container instead. This saves us from having to modify the original container image.

Using an EmptyDir volume instead of a ConfigMap volume

The init container JSON could look something like this:

{
"name": "setup",
"image": "busybox",
"command": ["sh", "-c", "echo \"{\\\"endpoint\\\": \\\"http://$POD_IP:8080/metrics\\\"}\" > /etc/custom-metrics/definition.json"],
"env": [{
"name": "POD_IP",
"valueFrom": {
"fieldRef": {
"fieldPath": "status.podIP"
}
}
}],
"volumeMounts": [{
"name": "config",
"mountPath": "/etc/custom-metrics"
}]
}

Configuring the Horizontal Pod Autoscaler to use the custom metric

Creating the HorizontalPodAutoscaler object is pretty straightforward. Because autoscaling based on custom metrics is still in alpha, we configure it through an annotation on the HPA object.

The manifest for an HorizontalPodAutoscaler, which makes sure the pods are scaled based on the QPS metric, would look something like this:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: kubia
annotations:
alpha/target.custom-metrics.podautoscaler.kubernetes.io: '{"items":[{"name":"qps", "value": "20"}]}'

spec:
maxReplicas: 5
minReplicas: 1
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: kubia
targetCPUUtilizationPercentage: 1000000

Note the annotation shown in bold. We’re setting the target QPS to 20, which means we’d like each of our pods to be handling 20 requests per second. So, if we initially have a single pod and it starts receiving 80 requests per second, we want the autoscaler to scale up the number of pods to 4, so that each pod only gets 20 requests per second.

Additionally, also note the targetCPUUtilizationPercentage, which needs to be set to a very high value so the autoscaler only performs autoscaling based on QPS and not also on the pods’ CPU usage. The autoscaler will determine the required number of replicas based on each metric individually and then scale to the greater number. By setting the target CPU utilization percentage to such a high number, the required number of replicas according to the CPU utilization metric will always be 1, which leads to the QPS metric being the determining factor.


Trying it out yourself

If you’d like to try this out immediately, you can use my example code at github.com/luksa/kubia-qps. Make sure minikube is running with the appropriate options as described above.

First, create the deployment:

$ kubectl create -f https://raw.githubusercontent.com/luksa/kubia-qps/master/qps-deployment.yaml
deployment "kubia" created

Then, expose the deployment through a service:

$ kubectl expose deployment kubia --port=80 --target-port=8080
service "kubia" exposed

Now create the horizontalPodAutoscaler object:

$ kubectl create -f https://raw.githubusercontent.com/luksa/kubia-qps/master/qps-autoscaler.yaml
horizontalpodautoscaler "kubia" created

And, finally, put some load on your pods:

$ kubectl run -it --rm --restart=Never loadgenerator --image=busybox -- sh -c "while true; do wget -O - -q http://kubia.default; done"
Waiting for pod default/loadgenerator to be running, status is Pending, pod ready: false
...

Try opening up another terminal and watching the HPA and the deployment objects:

$ watch kubectl describe hpa,deployment

Soon, you should see the value of the QPS metric on the HPA object increase (it is exposed through an annotation):

Annotations:                    
alpha/status.custom-metrics.podautoscaler.kubernetes.io=
{"items":[{"name":"qps","value":"119"}]}
alpha/target.custom-metrics.podautoscaler.kubernetes.io=
{"items":[{"name":"qps", "value": "20"}]}

At the same time, you’ll see the autoscaler scale up the number of replicas (look at the events of the HPA):

SuccessfulRescale    New size: 4; reason: Custom metric qps above 
target

And, of course, after the autoscaler increases the desired replica count on the deployment, additional pods will be spun up:

Replicas:      4 updated | 4 total | 4 available | 0 unavailable
...
Events:
ScalingReplicaSet Scaled up replica set kubia-990981907 to 4

If you then stop the load-generating pod (simply press ctrl-C and the pod will be terminated and deleted), the autoscaler will eventually scale back down to a single replica (but this may take a few minutes, depending on when the last rescale was performed).

Troubleshooting

If you don’t see the pods scaled up, make sure cAdvisor is actually collecting the QPS metric. You can check this by opening up the following URL in your browser:

http://$(minikube ip):4194/docker/

and then clicking on the subcontainer that starts with “k8s_nodejs” (and includes “kubia” somewhere in the middle). At the bottom, you should see a chart for the QPS metric:

The Application Metrics section in the cAdvisor web console

Similarly, you can also check if the metrics data gets it into Heapster. There is a Heapster service in the kube-system namespace, which you can use to get the metrics. You’ll find the QPS metric for a specific pod at:

http://heapster.kube-system.svc.cluster.local/api/v1/model/namespaces/
default/pods/<pod name>/metrics/custom/qps

Wrap up

So, as you see, it is already possible to set up autoscaling based on custom metrics — even on minikube.

Hopefully, this post has shed enough light on the subject so you can now go and set up autoscaling based on your own custom application metric.

Thank you for reading.


About the Author

Marko Lukša is a software engineer at Red Hat, where he is currently part of the Cloud Enablement team, bringing JBoss middleware products to OpenShift, a PaaS built on top of Kubernetes. He is also the author of Kubernetes in Action (Manning Publications, due out in Summer 2017).