Google Kubernetes Engine (GKE) Scalability Options and Sizing Optimization

Published in

google-cloud-apac

4 min readApr 29, 2020

This article provides an overview of the scaling options offered by Google Kubernetes Engine (GKE) and in which situation they can be used to handle scaling needs. We then go though some quick tips to rightsize GKE clusters to minimize over-provisioning. This article assumes the reader already has some basic knowledge and experience with GKE.

Scaling Strategy

1. Determine what your traffic looks like. Are there times in which the CPU and memory utilization spikes up?

2. If yes, consider enabling Horizontal Pod Autoscaler and Cluster Autoscaler.

3. If no, then depending on your pod’s resource requests, you should:

Enable Vertical Pod Autoscaler and/or
Increase the number of nodes in the node pool, and/or
Select more powerful machine type for the node pool

4. If you have not done any performance benchmarking in order to configure the pod’s resource requests, consider using Vertical Pod Autoscaler in recommendation mode for GKE to determine the optimal values for you.

5. You could also consider using Node Auto-provisioning.

Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment, or replica set based on the observed CPU utilization. The feature, however, does not apply to objects that aren’t meant to be scalable, e.g. DaemonSets.

GKE also supports autoscaling based on custom metrics exported to Stackdriver by the pods. To learn how to autoscale workloads using metrics available in Stackdriver, see Autoscaling deployments with External Metrics.

Cluster Autoscaler

Cluster Autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, GKE would automatically add new nodes to your cluster’s existing node pool, if there is not enough capacity on the existing pool; conversely, if a node pool is under-utilized, then GKE could delete the extra nodes.

Vertical Pod Autoscaler

Vertical Pod Autoscaler frees you from having to think about what values to specify for a container’s CPU and memory requests. The autoscaler can recommend values for CPU and memory requests, or it can automatically update the values.

Vertical pod autoscaling provides the following benefits:

Cluster nodes are used efficiently, because pods use exactly what they need
Pods are scheduled onto nodes that have the appropriate resources available (e.g. GPU)
You don’t have to run time-consuming benchmarking tasks to determine the correct values for CPU and memory requests
Maintenance time is reduced, because the autoscaler can adjust CPU and memory requests over time without any action on your part

Vertical Pod Autoscaler in recommendation mode

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: some-vpa
spec:
  targetRef:
    apiVersion: "extensions/v1"
    kind:       Deployment
    name:       some-deployment
  updatePolicy:
    updateMode: "Off" # Recommendation only mode

Node Auto-provisioning

Node auto-provisioning is a mechanism of cluster autoscaler, which automatically manages the list of node pools on the user’s behalf. Without node auto-provisioning, the cluster autoscaler would only add/remove new nodes from a set of existing node pools. With node auto-provisioning, new node pools can be created and deleted automatically.

Cluster Sizing Optimization

The quickest way to optimize a GKE cluster utilization and avoid over-provisioning is to enable Cluster Autoscaling. This allows GKE to automatically resize clusters based on the demands of the workloads. Nodes will be added or deleted based on the resource request of all the pods, hence it is important to configure the values accurately in the pod’s manifest. The values are usually determined from performance benchmarking of the applications. However, it is also possible to enable Vertical Pod Autoscaler to let GKE determine the optimal request and limit values automatically.

It would also be useful to enable GKE usage metering, so that users can track resource requests and actual resource usage of the workloads over a period of time. This will help determine the most optimal resource request values and node pool configurations.

By default, Cluster Autoscaling would run in a “balanced” mode, in which case GKE will prioritize availability over down-scaling; this is a more conservative profile. On the other hand, users can specify “optimize-utilization” mode, which will cause GKE to scale down nodes more aggressively. This profile is ideal for use with batch workloads that are not sensitive to start-up latency. For serving workloads, it is best to stick with the “balanced” mode. See Autoscaling Profiles.

Cluster Multi-tenancy

Another approach to optimize cluster utilization and costs would be a multi-tenant cluster, in which a very large cluster is provisioned and shared by multiple teams/projects (tenants). See GKE Cluster multi-tenancy for best practices. While this approach may help optimize resources by reducing the chance of over-provisioning, it does come with additional overheads such as the effort to manage namespaces, access controls, network policies, etc.

Committed Use Discounts

From costs perspective, committed use discounts are ideal for workloads with predictable resource needs. When you purchase a committed use contract from Google Cloud, you gain access to compute resources (vCPUs, memory, GPUs, and local SSDs) at a discounted price. The discount is up to 57% for most machine types and GPUs, and can even go up to 70% for memory-optimized machine types. See VM instances pricing for committed use pricing of different machine types.

Note that even without committed use discounts, long running GKE clusters can still benefit from sustained use discounts automatically.