Traffic Based Horizontal Pod Autoscaler for GKE Clusters

Published in

Google Cloud - Community

5 min readDec 6, 2022

HPA or Horizontal Pod Autoscaler is the automatic increasing or decreasing of number of pods serving the workload.

But when we talk about Horizontal Pod Autoscaler, CPU and Memory metrics are the very first metrics that comes to everyone’s mind. But there are certain applications which are bound towards capacity limits which is not reflected in their CPU or memory usage. Therefore, adding the traffic utilization metrics/requests per second (RPS) metrics as well to the HPA provides a holistic way of autoscaling in those situations because it is more aligned with the application usage.

Hence we can use GCP Gateway controller API to expose applications which has a feature currently in Preview mode which natively provides integration of the traffic signals from load balancers to the Kubernetes API Server to autoscale Pods.

In this blog, we are going to demonstrate how to use Gateway Controller and Horizontal Pod Autoscaler for traffic based autoscaling for a sample hello-world application.

Let’s begin with the demonstration.

Prerequisites

We need one GKE cluster with version 1.24 or later.
kubectl CLI should be installed and configured to your GKE cluster.
gcloud CLI should be installed and configured with the GCP project.

Please refer to GKE Gateway Controller requirements.

Enable Gateway Controller API

Enable the Gateway API on the existing cluster gke-gateway-demo.

gcloud container clusters update gke-gateway-demo \
 - gateway-api=standard \
 - zone=europe-west3-c

Deploy the application with ClusterIP service

Deploy the hello world application on GKE cluster and expose it internally with ClusterIP service.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-world
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hello-world
  template:
    metadata:
      labels:
        app: hello-world
    spec:
      containers:
      - name: whereami
        image: us-docker.pkg.dev/google-samples/containers/gke/whereami:v1.2.11
        ports:
          - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: hello-world
  annotations:
    networking.gke.io/max-rate-per-endpoint: "10"

The above YAML will create:

A Deployment with 2 replicas.
A ClusterIP Service with Service capacity set to 10 via max-rate-per-endpoint annotation.

The Service capacity is a critical element when using traffic-based autoscaling because it defines the maximum traffic a Service should receive in requests per second, per Pod. It is configured using the Service annotation networking.gke.io/max-rate-per-endpoint .

Create a Gateway Resource

Expose the application to the outside world using GKE Gateway.

kind: Gateway
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: hello-world
spec:
  gatewayClassName: gke-l7-gxlb
  listeners:
  - name: http
    protocol: HTTP
    port: 80

The above YAML will create create Global HTTP Load Balancer for the GKE Cluster:

gatewayClassName: gke-l7-gxlb: specifies the GatewayClass. gke-l7-rilb corresponds to the global external HTTP(S) load balancer.
port: 80: specifies that the Gateway exposes only port 80 for listening for HTTP traffic.

Traffic based autoscaling is supported by Global and Internal Load Balancers for single GKE clusters only. For more information on Gateway Class Capabilities, Please refer to the official documentation.

Create a HTTPRoute

Create a HTTPRoute resource thats defines protocol-specific rules for mapping traffic from a Gateway to Kubernetes backends services.

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: hello-world
  labels:
    gateway: hello-world
spec:
  parentRefs:
  - name: hello-world
  rules:
  - backendRefs:
    - name: hello-world
      port: 8080

The above YAML will create routing rules in the following manner:

All Traffic to Gateway IP goes to the Service hello-world.

That’s a very simple HTTPRoute, but we can generate more complex traffic routing based on host, path or even headers based routing.

Test the application

Get the IP address of the HTTP Load Balancer created by Gateway and access the application.

export GATEWAY_IP_ADDRESS=`kubectl get gateway -o=jsonpath='{.items[?(@.metadata.name=="hello-world")].status.addresses[0].value}'`
curl http://$GATEWAY_IP_ADDRESS/

You should be getting response like below:

Now we are all set with the application and ready to deploy the HPA.

Deploy HPA

Deploy traffic based HPA.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hello-world
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hello-world
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Object
    object:
      describedObject:
        kind: Service
        name: hello-world
      metric:
        name: "autoscaling.googleapis.com|gclb-capacity-utilization"
      target:
        averageValue: 70
        type: AverageValue

The above YAML describes a HorizontalPodAutoscaler with the following properties:

scaleTargetRef.name: hello-world: the reference to the hello-world Deployment that defines the resource that is scaled by the Horizontal Pod Autoscaler.
metric.name: "autoscaling.googleapis.com|gclb-capacity-utilization": referring to Load Balancer capacity utilization metric for autoscaling.
averageValue: 70: target average value of capacity utilization for HPA.

See HPA in action

Now let’s deploy a traffic generator with 20 RPS to validate traffic-based autoscaling behaviour. Update the GATEWAY_IP_ADDRESS with your Gateway IP address.

kubectl run --context=gke-gateway-demo -i --tty --rm loadgen  \
    --image=cyrilbkr/httperf  \
    --restart=Never  \
    -- /bin/sh -c 'httperf  \
    --server=GATEWAY_IP_ADDRESS \
    --hog --uri="/zone" --port 80  --wsess=100000,1,1 --rate 20'

The Deployment is scaled to ~4 replicas so that each replica ideally receives 5 RPS of traffic, which would be 50% utilization per Pod. This is under the 70% target utilization and so the Pods are scaled appropriately.

Depending on traffic fluctuations, the number of autoscaled replicas might also fluctuate.

The autoscaler attempts to scale replicas to achieve the following equation:

replicas = ceiling[current traffic / ( averageUtilization * max-rate-per-endpoint)]

Conclusion

We have seen in practice how to do traffic based autoscaling of workloads in GKE cluster with native Gateway capabilities. We can also do the same with the help of service mesh like Istio, Prometheus and Prometheus adapter. But with GKE Gateway this come built-in without the need of any service mesh and adaptor.

Keep in mind that the GKE Gateway controller Traffic based autoscaling is still in Preview mode, you can expect some changes to come as it goes General Availability.

Please refer to the official documentation to learn more about GKE Gateway. Hope you find it useful. Happy Learning.