Democratizing AI: How GKE Makes Machine Learning Accessible

Published in

Google Cloud - Community

5 min readDec 21, 2023

Image of a robot with GKE written on it’s chest — Image generated from Vertex AI Studio using Imagen 2

Generative AI has kept the GKE product team busy over the last year. We put together this article with a curated list of many of the new features we have released on GKE especially useful for Machine Learning, Artificial Intelligence and Large Language Models. We also listed some Open Source and community projects that work well on GKE.

This article is largely based on content authored originally by Nathan Beach with the help of Marcus Johansson.

GPUs

Graphics Processing Units are a very common type of Hardware Accelerators used to perform resource-intensive tasks, such as Machine learning (ML) inference and training and Large-scale data processing. In GKE Autopilot and Standard, you can attach GPU hardware to nodes in your clusters, and then allocate GPU resources to containerised workloads running on those nodes.

A3 VM, powered by NVIDIA H100 GPUs, is generally available The A3 VM is optimised for GPU supercomputing and offers 3x faster training and 10x greater networking bandwidth compared to the prior generation. A3 is also able to operate at scale, enabling users to scale models to tens of thousands of NVIDIA H100 GPUs.
G2 VM with NVIDIA L4 GPUs offers great inference performance-per-dollar The G2 VM became GA earlier this year, but we recently announced fantastic MLPerf results for the G2, including up to 1.8x improvement in performance per dollar compared to a comparable public cloud inference offering.
GPUs slicing on GKE: When using GPUs with GKE, Kubernetes allocates one full GPU per container even if the container only needs a fraction of the GPU for its workload, which might lead to wasted resources and cost overrun. To improve GPU utilisation, multi-instance GPUs allow you to partition a single NVIDIA A100 GPU in up to seven slices. Each slice can be allocated to one container on the node independently.
GPU dashboard available on the GKE cluster details page: When viewing a specific GKE cluster details in the Cloud Console, the Observability tab of the GKE cluster now includes a dashboard for GPU metrics. This provides visibility into utilisation of GPU resources, including utilisation by GPU model and by Kubernetes node.
Autopilot now supports L4 GPUs in addition to existing support for NVIDIAs T4, A100, and A100–80GB GPUs.
Automatic GPU driver installation is available in GKE 1.27.2-gke.1200 and later, which enables you to install NVIDIA GPU drivers on nodes without manually applying a DaemonSet.

TPUs

TensorFlow Processing Units (TPUs) are Google’s custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. Compared to GPUs which are general purpose processing units that support many different applications and software. TPUs are optimised to handle massive matrix operations used in neural networks at fast speeds. GKE supports adding TPUs to nodes in the cluster to train machine learning models.

TPUs are generally available in GKE At Cloud Next ’23 in August, we announced the preview of Cloud TPU v5e, and it’s now generally available.
TPU v5e is GA and offers fantastic price-performance The latest generation of our Cloud TPU platform (TPU v5e) achieved a 2.3x improvement in price-performance compared to the previous generation (TPU v4) for training large language models (LLMs).
TPU Multi-slice is GA This means you can massively scale your AI training workloads with near-ideal scaling efficiency even as you scale to many TPU slices and tens of thousands of TPU chips. We used TPU Multi-slice in GKE to train a single model with over 50,000 TPU v5e chips.
TPU v5e provided 4.2x better inference price-performance than A100 GPU for one customer (AssemblyAI): For their real-world inference traffic, AssemblyAI was able to achieve 4.2x better performance per dollar by switching to Cloud TPU v5e compared to A2 VMs with A100 GPUs and 2.7x better performance per dollar compared to G2 VMs with L4 GPUs. That translates to big cost savings!
New inference-focused TPU v5e machine types are available in GKE. These single-host TPU VMs are designed for inference workloads and contain one, four, or eight TPU v5e chips. These three new TPU v5e machine types are currently available in the us-central1-a and europe-west4-b zones.
Customers like Anthropic rely on TPU in GKE for LLM inference Anthropic is using Cloud TPU v5e in GKE to serve its Claude large language model (LLM) in a performant and efficient manner.
New toolkit for training massive AI workloads We released xpk (Accelerated Processing Kit, pronounced x-p-k), which is a software tool to help Cloud developers orchestrate training jobs on accelerators such as TPUs and GPUs on GKE. xpk handles the “multihost pods” of TPUs and GPUs (HGX H100) as first class citizens.
Two new TPU usage metrics are available TensorCore utilisation and Memory Bandwidth utilisation, which you can use to help optimize TPU utilisation.
Security improvement Starting in GKE version 1.28.1-gke.201, TPU workloads in GKE have privileged mode enabled, which makes possible an improved security posture.

Orchestration and resource management

GKE is now supported in the HPC Toolkit to simplify landing zone creation.
Future reservations, now available in Preview, allow you to request compute resources such as GPUs for a period of time starting sometime in the future. This can ensure the requested capacity is available at the date and time specified.
Kueue, our job scheduling add-on, graduated to beta as we shipped 5 feature-rich releases.
Kubernetes Job API now supports scalability on-par with Deployments. Job also supports restartability policies.
Dynamic Workload Scheduler supports GKE through the Provisioning Request API launched in Preview in GKE 1.28. Use the Dynamic Workload Scheduler to get large atomic sets of available GPU models in GKE Standard clusters. For more information, see Deploy GPUs for batch workloads with ProvisioningRequest. You can also simplify the initial setup and automate DWS orchestration using Kueue.
Google Cloud Storage FUSE is generally available. GCS FUSE can reduce pod startup time by up to 40% and improves the portability of your code by enabling access to GCS data using file semantics.

Ray on GKE

Ray.io is an open-source framework to easily scale up Python applications across multiple nodes in a cluster. Ray provides a simple API for building distributed, parallelized applications, especially for deep learning applications.

Getting started with Ray on Google Kubernetes Engine
Guide to using Ray on GKE with TPUs
Serving Large Language Models with KubeRay on TPUs presentation at Ray Summit ‘23.

Recently published Resources & Tutorials

Serve a LLM like Llama 2 70B or Falcon 40B with multiple GPUs in GKE
Train a model with GPUs in GKE
Serve an LLM using multi-host TPUs on GKE with Saxml
Reference architecture for model serving using TPU v5e, which demonstrates fantastic price-performance
How to improve launch time of Stable Diffusion on GKE by 4X
Best practices for running batch workloads on GKE
Why use GKE for batch processing
github: ai-on-gke code samples
Use Filestore with GKE
How to add generative AI features to your own applications
Quota sharing between namespaces
Intro to Kueue on GKE
Kueue, our open source Kubernetes-native Job queueing system, now supports not only the Kubernetes Job API but also a number of other frameworks/execution modes. See these sample guides:
Run a Kueue managed batch/Job
Run a Kueue managed Flux MiniCluster
Run a Kueue managed Kubeflow Job
Run a Kueue managed KubeRay RayJob
Submit Kueue jobs from Python
Run a Kueue managed plain Pod

Visit g.co/cloud/gke-aiml for helpful resources about running AI workloads on GKE.