Serving Mistral 7B on L4 GPUs running on GKE

Javier Cañadillas
Google Cloud - Community
8 min readDec 1, 2023

This post shows how to serve Mistral 7B model on L4 GPUs running on Google Cloud Kubernetes Engine (GKE). It will help you understand the AI/ML ready features of GKE and how to use them to serve large language models.

Mistral 7B is a 7 billion parameter language model that was introduced in September 2023. Describing how Mistral 7B works is beyond the scope of this post; you can read more about it in the Mistral 7B paper. Suffice to say that it was chosen because it is a state-of-the-art model that is open-source, easy to use and efficient, and it outperforms other language models like Llama 2 13B on all evaluated benchmarks.

GKE is a fully managed service that allows you to run containerized workloads on Google Cloud. It’s a great choice for running large language models and AI/ML workloads because it is easy to set up, it’s secure, and it’s AI/ML batteries included. GKE installs the latest NVIDIA GPU drivers for you in GPU-enabled node pools, and gives you autoscaling and partitioning capabilities for GPUs out of the box, so you can easily scale your workloads to the size you need while keeping the costs under control. You can learn more about GKE’s AI/ML ready features in this Cloud OnAir covering Open AI/ML Platforms on GKE.

For convenience, in this post you’ll be deploying the model using the text-generation-interface container images provided by the HuggingFace AI Community.

Pre-requisites

To follow the steps in this tutorial, you will need:

  • access to a Google Cloud project with the L4 GPUs available and enough quota in the region you select.
  • a computer terminal with kubectl and the Google Cloud SDK installed. From the GCP project console you’ll be working with, you may want to use the included Cloud Shell as it already has the required tools installed.

Setting up the required infrastructure

From your console, select the Google Cloud region and project, checking that there’s availability for L4 GPUs in the one that you end up selecting. The one used in this tutorial is europe-west1, where at the time of writing this article there was availability for L4 GPUs:

export PROJECT_ID=<your-project-id>
export REGION=europe-west1
export ZONE_1=${REGION}-b # You may want to change the zone letter based on the region you selected above
export ZONE_2=${REGION}-c # You may want to change the zone letter based on the region you selected above
export CLUSTER_NAME=llm-serving-cluster
gcloud config set project "$PROJECT_ID"
gcloud config set compute/region "$REGION"
gcloud config set compute/zone "$ZONE_1"

Then, enable the required APIs to create a GK cluster:

gcloud services enable compute.googleapis.com container.googleapis.com

As you will be using the default service account to create the cluster, you need to grant it the required permissions to store metrics and logs in Cloud Monitoring that you will be using later on:

PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)')
GCE_SA="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com"
for role in monitoring.metricWriter stackdriver.resourceMetadata.writer; do
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:${GCE_SA} --role=roles/${role}
done

Now, create a GKE cluster with a minimal default node pool, as you will be adding a node pool with L4 GPUs later on:

gcloud container clusters create $CLUSTER_NAME \
--location "$REGION" \
--workload-pool "${PROJECT_ID}.svc.id.goog" \
--enable-image-streaming --enable-shielded-nodes \
--shielded-secure-boot --shielded-integrity-monitoring \
--enable-ip-alias \
--node-locations="$ZONE_1" \
--workload-pool="${PROJECT_ID}.svc.id.goog" \
--addons GcsFuseCsiDriver \
--no-enable-master-authorized-networks \
--machine-type n2d-standard-4 \
--num-nodes 1 --min-nodes 1 --max-nodes 5 \
--ephemeral-storage-local-ssd=count=2 \
--enable-ip-alias

GKE auto installs the required NVIDIA GPU drivers on the nodes for you. Note that due to latest changes in the way that process works in GKE, using the --enable-private-nodes option that would create private cluster nodes won’t work without an additional NAT gateway configuration. That’s why you’re not using this option here.

Create an additional node pool with regular (non-spot) VMs with 2 L4 GPUs each:

gcloud container node-pools create g2-standard-24 --cluster $CLUSTER_NAME \
--accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
--machine-type g2-standard-24 \
--ephemeral-storage-local-ssd=count=2 \
--enable-autoscaling --enable-image-streaming \
--num-nodes=1 --min-nodes=0 --max-nodes=2 \
--shielded-secure-boot \
--shielded-integrity-monitoring \
--node-locations $ZONE_1,$ZONE_2 --region $REGION

Note how easy enabling GPUs in GKE is. Just adding the option --accelerator automatically bootstraps the nodes with the necessary drivers and configuration so your workloads can start using the GPUs attached to the cluster nodes.

After a few minutes, check that the node pool was created correctly:

gcloud container node-pools list --region $REGION --cluster $CLUSTER_NAME

Also, check that the corresponding nodes in the g2-standard-24 node pool have the GPUs available:

kubectl get nodes -o json | jq -r '.items[] | {name:.metadata.name, gpus:.status.capacity."nvidia.com/gpu"}'

You should get one with 2 GPUs available corresponding to the node pool you just created:

{
"name": "gke-llm-serving-cluster-g2-standard-24-1-4e2b2f3d-2q2q",
"gpus": "2"
}

Deploying the model

You’re now ready to deploy the model. As mentioned before, you’ll be using HuggingFace’s Text Generation Interface container images for this, but deploying the corresponding container into your own cluster.

Copy the following YAML content into a file named mistral-7b.yaml in your local machine:

apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
containers:
- name: mistral-7b
image: ghcr.io/huggingface/text-generation-inference:1.1.1
resources:
limits:
nvidia.com/gpu: 1
ports:
- name: server-port
containerPort: 8080
env:
- name: MODEL_ID
value: mistralai/Mistral-7B-Instruct-v0.1
- name: NUM_SHARD
value: "1"
- name: PORT
value: "8080"
- name: QUANTIZE
value: bitsandbytes-nf4
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /data
name: data
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: data
hostPath:
path: /mnt/stateful_partition/kube-ephemeral-ssd/mistral-data
---
apiVersion: v1
kind: Service
metadata:
name: mistral-7b-service
namespace: default
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8080
selector:
app: mistral-7b

The full definition of environment variables that you can inject into the container are defined in Text Generation Interface’s Readme file. The most relevant one here is the model quantization being selected, as it will affect both the size and the inference speed of the model.

Having the GPU usage declaration will automatically trigger in GKE the addition of the corresponding toleration to the pod so it can be scheduled in the node pool with the GPUs by running the ExtendedResourceToleration admission controller. This works despite the node pool having an automatic taint configured on the nodes that prevents pods from being scheduled on them to avoid wasting your precious GPU resources by mistake.

Now, create the Kubernetes objects:

kubectl apply -f mistral-7b.yaml

Check all the objects you’ve just created:

kubectl get all

Check that the pod has been correctly scheduled in one of the nodes in the g2-standard-24 node pool that has the GPUs available:

POD_NAME=$(kubectl get pods -l app=mistral-7b -o json | jq -r ‘.items[0].metadata.name’)
kubectl get pods $POD_NAME -o json | jq -r ‘.spec.nodeName’

The resulting output should contain the name g2-standard-24 that you used to identify the node pool with the GPUs.

Finally, make sure the text generation inference app is running correctly:

kubectl logs -l app=mistral-7b

Testing the model

To test the model, port forward the port where the inference server is running to your local machine:

kubectl port-forward deployment/mistral-7b 8080:8080

Then, in another terminal, test the model by sending a request to the inference server:

curl -s 127.0.0.1:8080/generate -X POST -H ‘Content-Type: application/json’ — data-binary @- <<EOF | jq -r ‘.generated_text’
{
“inputs”: “[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don’t know the answer to a question, please don’t share false information.\n<</SYS>>\nHow to deploy a container on K8s?[/INST]”,
“parameters”: {“max_new_tokens”: 400}
}
EOF

Wait for the answer, that should look something similar to this:

To deploy a container on Kubernetes (K8s), you can follow these general steps:
1. Create a Kubernetes deployment manifest file: This file describes the desired state of your application and the container(s) that should be deployed. It includes information such as the container image, port mapping, and resource requests.
2. Create a Kubernetes service manifest file: This file defines a logical set of pods (containers) and exposes them as a network service. It includes information such as the service type (e.g. LoadBalancer, ClusterIP) and port mapping.
3. Apply the deployment and service manifest files to your K8s cluster using the kubectl command-line tool.
4. Verify that the container(s) have been successfully deployed by checking the status of the deployment and service objects in the K8s dashboard or using the kubectl command-line tool.
It’s important to note that the specific steps for deploying a container on K8s may vary depending on your specific use case and the tools and technologies you are using. It’s also a good idea to familiarize yourself with the Kubernetes documentation and best practices for deploying applications on the platform.

You can go to your browser, open the GCP console for the Kubernetes workloads overview, select the mistral-7b deployment and look in the overview tab for the GPU usage (memory and duty cycle) of the pod.

Overview of resources usage of the Mistral 7b deployment

Open another tab, and try the Swagger endpoint that’s exposed for the inference server (copy and paste the URL generated by this command in your browser):

LB_IP=$(kubectl get service mistral-7b-service -o jsonpath=’{.status.loadBalancer.ingress[0].ip}’)
echo “http://${LB_IP}/docs"

Once the UI is loaded, select the /generate endpoint and and click on the “Try it out” button. Then, paste the following JSON request in the text field:

{
“inputs”: “[INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don’t know the answer to a question, please don’t share false information. What is a Kubernetes secret?[/INST]”,
“parameters”: {
“best_of”: 1,
“do_sample”: true,
“max_new_tokens”: 400,
“repetition_penalty”: 1.03,
“return_full_text”: false,
“temperature”: 0.5,
“top_k”: 10,
“top_n_tokens”: 5,
“top_p”: 0.95,
“truncate”: null,
“typical_p”: 0.95,
“watermark”: true
}
}
Swagger docs for the Text Generation Interface

Then, click on the blue “Execute” button right below the text field. You should get something like the following, where you can appreciate the model performance in the response headers:

Query response, with technical data in the headers

Cleaning up

Don’t forget to clean up the resources created in this article once you’ve finished experimenting with GKE and Mistral 7b, as keeping the cluster running for a long time can incur in important costs. To clean up, you just need to delete the GKE cluster:

gcloud container clusters delete $CLUSTER_NAME — region $REGION

Conclusion

This post tries to demonstrate how deploying AI/ML workloads on GKE is easy and straightforward using Mistral 7B as an example. Being close to the way operations work in the Kubernetes ecosystems enables deploying LLM models in production, bringing ML Ops one step closer. Also, given the resources that are consumed, and the number of potential applications using AI/ML features moving forward, having a framework that offers scalability and cost control features simplifies adoption.

For further information on GKE capabilities orchestrating AI/ML workloads, check the resources included in the AI/ML orchestration on GKE documentation.

--

--

Javier Cañadillas
Google Cloud - Community

I'm a Googler working in Modern Applications and DevOps @ Google Cloud