Scaling workloads based on GPU utilization in GKE

Ronald Moesbergen
Deepdesk
Published in
3 min readSep 22, 2021

--

With GPU accelerated workloads becoming more common, scaling workloads based on CPU is no longer optimal. Instead, it would be better to scale on actual GPU usage. Here’s how to do just that in Google cloud / Google Kubernetes Engine (GKE).

Step 1: Setting up the Google cloud metrics adapter
First, we’ll need to install a component into Kubernetes that exposes the Google cloud metrics (formerly known as Stackdriver) to Kubernetes through the ‘external metrics API.’ Google has created the ‘Custom Metrics Stackdriver Adapter’ for this. Installing it can be as simple as:

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

I strongly recommend reviewing all the resources that it creates, though. Or even better: deploying it with your favorite Infrastructure-as-code tool. Mine is Terraform, as you’ll see below.

NOTE: the adapter will register itself by default for both the ‘custom.metrics.k8s.io’ API and the ‘external.metrics.k8s.io’ API. The first one can cause problems if there are other metrics adapters in use. In that case, remove the ApiService with ‘name: v1beta1.custom.metrics.k8s.io’ from the manifest. We only need the external metrics API for GPU-based scaling (or any other standard GCP metric).

After installation, we need to give permissions to the service account that the adapter uses to read google cloud metrics. I recommend using “Workload Identity” for this since it’s the most secure and relatively easy way to do it. Here’s how using Terraform:

# Create a google service account for the adapter to use
resource "google_service_account" "custom-metrics-adapter" {
account_id = "custom-metrics-adapter"
}
# Give the account the 'monitoring.viewer' role to allow it
# to read google cloud metrics
resource "google_project_iam_member" "cma-monitoring-reader" {
member = "serviceAccount:${google_service_account.custom-metrics-adapter.email}"
role = "roles/monitoring.viewer"
}
# Get the current project id from terraform provider configuration
data "google_client_config" "gcp" {}
# Enable workload identity for the serviceaccount, binding it to the # kubernetes SA used by stackdriver-adapter. Note I'm using the
# 'monitoring' namespace
resource "google_service_account_iam_member" "cma-wlid" {
role = "roles/iam.workloadIdentityUser"
service_account_id = google_service_account.custom-metrics-adapter.id
member = "serviceAccount:${data.google_client_config.gcp.project}.svc.id.goog[monitoring/custom-metrics-stackdriver-adapter]"
}

All we need to do now is annotate the existing Kubernetes service account with the name of the google service account:

$ kubectl annotate serviceaccounts custom-metrics-stackdriver-adapter "iam.gke.io/gcp-service-account"="custom-metrics-adapter@<projectid>.iam.gserviceaccount.com" -n monitoring

Step 2: Using the external metric in a horizontal pod autoscaler
Now that the metrics are available in Kubernetes, we can use them to scale our workloads with the HorizonalPodAutoscaler:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-workload
spec:
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: kubernetes.io|container|accelerator|duty_cycle
selector:
matchLabels:
resource.labels.container_name: gpu-container-name
target:
type: AverageValue
averageValue: 80
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: "gpu-workload"

Some things to note:
1. The example is using the “autoscaling/v2beta2” API, which allows us to scale based on more than one metric.
2. The first metric is the ‘External’ metric called “accelerator/duty_cycle.” This metric is available per container, which we filter on in the selector.
3. The second metric is the usual CPU-based metric. Just in case a workload becomes CPU-bound, it will also scale on this.

That’s it! You should now see your HorizontalPodAutoscaler scale based on both GPU duty cycle and CPU:

$ kubectl get hpa
NAME REFERENCE TARGETS MIN MAX
gpu-workload Deployment/gpu-workload 26/80 (avg), 47%/80% 1 10

--

--