AI on GKE — how to find the right size?
Recently, large language models (LLMs) have emerged as a powerful tool for a wide range of natural language processing tasks. Shortly speaking, GenAI is everywhere. We are trying it in so many different tasks! Look into your fridge — GenAI? Look at your GKE cluster — can you find GenAI workloads there?
How to deploy a model in the own infrastructure? Which platform to choose? And how to find the right infrastructure sizing? Wouldn’t it be good to see the numbers?
In this post we will try step-by-step deploying and benchmarking Gemma model on several GKE infrastructure configuration to compare the inference time on a given set of prompts.
Model
Open Foundation Models (OFM) are popular in AI community, because they are accessible, transparent and reproducible. They allow for task and domain specific fine-tuning, and can be deployed everywhere. Nice OFM model repository can be found at HuggingFace. For the purpose of this post I have chosen Gemma model — a lightweight, state-of-the-art OFM model from Google, built from the same research used to create the Gemini models. But, what is maybe worth mentioning here, the framework I am describing allows deploying any OFM model from HuggingFace repository.
Platform
Where to deploy LLM model? When using Google Cloud, if you need unified managed AI platform, then VertexAI is an option. On the other hand, if you want more granular control, scalability and portability of Kubernetes, then Google Kubernetes Engine (GKE) can be an option. GKE is a fully managed Kubernetes, with infrastructure management taken care of, with High-Availability setup, scaling, repair and upgrades. For AI workloads it offers specialized compute resources (TPU, GPU). Additionally, with GCS Fuse it allows to access all data stored in GCS buckets as a filesystem.
Lets start with deployment
To make both infrastructure and model deployment easy, we will use the open-source repository, we have contributed to, available on github.
Lets clone this repository then:
git clone https://github.com/GoogleCloudPlatform/ai-on-gke
This repository is a quite comprehensive set of assets related to AI/ML workloads on GKE.
For now we will go with benchmarks
folder there. We can find there a set of Terraform modules, setting up whole environment, from cluster, service accounts and right permissions to inference server and model download, and finally, benchmarking framework.
The benchmarks
directory consists of several deployment stages:
/benchmarks
|
|- benchmark
|- inference-server
|- infra
|
|- stage-1
|- stage-2
|- orchestration
The architecture of this module is, well, modular, which requires running one stage after another and passing parameters between stages. Such a modular construction allows components interchangeability (eg use of different inference server, etc).
The stages, in order they could be run are:
- infra / stage-1— consists of terraform modules setting up the project, GKE cluster, GKE node pools according to user specification
- infra / stage-2 — creates all remaining GCP objects, like needed GCS buckets, service accounts and permissions.
- inference-server / — the inference server setup
- benchmark / dataset / — should consist of set of prompts needed for benchmarking
- benchmark / tools / — the benchmarking tool that should be used.
For the purpose of this article, we will create Text Generation Inference server, and will use Locust as a benchmarking tool.
To create that we need to run stages in following order:
As an input parameters for the very first stage, we need to specify the cluster details (like project name, vpc details), as well as hardware configuration of the cluster. For hardware configuration, lets start with something like this:
project_id = "<PROJECT_NAME>"
cluster_name = "<CLUSTER_NAME>"
region = "us-central1"
gke_location = "us-central1-a"
prefix = "ai-gke-0"
vpc_create = {
enable_cloud_nat = true
}
cluster_options = {
enable_gcs_fuse_csi_driver = true
enable_gcp_filestore_csi_driver = true
enable_gce_persistent_disk_csi_driver = true
}
# this allows using kubectl from the cloud-shell, but is somehow a security tradeoff
# benchmarking clusters are rather test and short living, so we could go this way this time
enable_private_endpoint = false
nodepools = {
nodepool-cpu = {
machine_type = "n2-standard-2",
},
nodepool-gpu = {
machine_type = "g2-standard-24",
guest_accelerator = {
type = "nvidia-l4",
count = 2,
gpu_driver = {
version = "LATEST"
}
}
}
}
The cluster configuration, created in stage 1, follows the Google best practices as it uses GCP Fast Fabric modules under the hood, as well as the best practices from GKE Jumpstart examples.
For this stage, we need to specify which GCP resources will be created, as well as to pass some resource ids created in the previous stage.
# can be obtained from stage-1 by running:
# terraform output -json | jq '."fleet_host".value'
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/<PROJECT_NUMBER>/locations/global/gkeMemberships/<CLUSTER_NAME>"
}
#terraform output -json | jq '."project_id".value'
project_id = "<PROJECT_NAME>"
bucket_name = "ai-gke-benchmark-fuse"
bucket_location = "EU"
output_bucket_name = "ai-gke-benchmark-results"
output_bucket_location = "EU"
google_service_account = "benchmark-sa"
kubernetes_service_account = "benchmark-ksa"
benchmark_runner_google_service_account = "locust-runner-sa"
# optional - if added, the Secret Manager secret will be created
# and needs to be filled manualy with user HuggingFace token
# this allows downloading models from HuggingFace which requires users credentials
secret_name = "hugging_face_secret"
secret_location = "europe-central2"
At this point, if you decided to create a secret for Hugging Face token, it is a moment to add this token manually as a new version of created secret.
Note: for using Gemma model it is required to pass user credentials to Secret Manager, as well as to visit https://huggingface.co/google/gemma-7b-it and accept the license.
3. inference-server / text-generation-inference — consists of all setup required for Text Generation Inference server from HuggingFace
Similarly to the previous stage, we need to pass some parameters from the outputs of infra stages.
For model_id
we specify Gemma model from HF. In the same way, you can deploy any other model from HF of your choice.
# can be obtained from stage-1 by running:
# terraform output -json | jq '."fleet_host".value'
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/<PROJECT_NUMBER>/locations/global/gkeMemberships/<CLUSTER_NAME>"
}
#terraform output -json | jq '."project_id".value'
project_id = "<PROJECT_NAME>"
# as specified in infra / stage-2
namespace = "benchmark"
ksa = "benchmark-ksa"
# model id from HF
model_id = "google/gemma-2b-it"
# if spcecified and filled proviously
hugging_face_secret = "projects/<PROJECT_NAME>/secrets/hugging_face_secret"
hugging_face_secret_version = 1
Great, now we have TGI and the model set up and ready to test.
Let’s test it!
We can now test the inference server with sample prompt:
kubectl run --image=nginx --command -n benchmark test -- curl tgi/generate -X POST -d '{"inputs":"Who are you?","parameters":{"max_new_tokens":10}}' -H 'Content-Type: application/json'
kubectl logs test -n benchmark
Note:
to use kubectl
on the cluster, you need to add your endpoint (eg CloudShell console IP address) to Control plane authorized networks
Benchmark
The next steps to run are ones to set up benchmarking tools. For the purpose of this post, we will set up Locust with some additional orchestration.
Lets prepare testing data set
The prompts for benchmarking are downloaded from a given GCS path prior to starting the Locust tasks. Prompts are read in line by line. Each prompt should be stored on its own new line, eg.:
Tell me who are you? \n
What about this prompt?.\n
Example prompt datasets are available in the /benchmarks/dataset
folder with python scripts and instructions on how to make the dataset available for consumption by benchmark:
benchmark / dataset / ShareGPT_v3_unflitered_cleaned_split/
Benchmarking framework
Benchmarking framework set up terraform scripts are available at location:
- benchmark / tools / locust-load-inference — this module creates Locust infrastructure which allows both load testing with Locust as well as metrics scrapping both from Locust and from GCP Cloud Monitoring.
Sample environment created by all of steps that were run till now could be illustrated as follows:
Try benchmarking
To run benchmarking, get the runner endpoint IP address:
kubectl get service -n $LOCUST_NAMESPACE locust-runner-api
Using the IP, run this curl command to instantiate the test:
curl -XGET http://$RUNNER_ENDPOINT_IP:8000/run
optionally, you can pass several parameters, defining the load and the test duration.
The results file will appear in GCS bucket specified as output_bucket
in input variables. Metrics and Locust statistics will be also visible under Cloud Monitoring dashboard.
Have a nice play!
And the last note — after all, you can destroy everything by running terraform destroy
for each stage, just in reverse order.
The views expressed are those of the author and don’t necessarily reflect those of Google.