Inferencing — Serve Llama 4 on A4 (B200 GPUs) using vLLM and GKE
The worlds of Artificial Intelligence and high-performance computing are not just colliding; they’re fusing. With state-of-the-art hardware like NVIDIA’s B200 GPUs and incredibly capable models like Meta’s Llama 4, the possibilities are expanding daily. In this guide, we’ll walk through how to deploy the Llama 4 Scout model on a Google Cloud A4 VM powered by eight B200 GPUs. We’ll use a GKE Autopilot cluster and vLLM for inference.
You can also jump right to the Google Cloud Documentation with the tutorial here ✅ — Use vLLM on GKE to run inference with Llama 4
Let’s get started!
Here’s what we’ll accomplish:
- Select the Llama 4 model from Hugging Face.
- Deploy a GKE Autopilot cluster with a powerful A4 node pool.
- Use vLLM to serve our model efficiently.
- Configure podmonitoring resource to get metrics from our vLLM server.
- Expose the model internally with a ClusterIP service.
- Run a quick inference test to see it all in action.
Prerequisites
You’ll need a Google Cloud project with billing enabled and a reservation for the A4 machine type to follow this guide.
Set up your environment
Select or create a project to use for your resources and billing
- Enable the following API
gcloud services enable container.googleapis.com- Grant roles to your user account. Run the following command once for each of the following IAM roles:
roles/container.admin
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLEConfigure variables
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAMESelect model from Hugging face
- Sign into your Hugging Face account https://huggingface.co/login
- Navigate to Meta LLama 4 model (Llama-4-Scout-17B-16E-Instruct)
- You’ll need to access the licenses agreement to get access to Meta’s models.
- Next create a token, Click Your Profile > Settings > Access Tokens> +Create new token
- Specify a Name of your choice and a Role of at least
Read. - Select Generate a token.
- Copy the generated token to your clipboard for later use.
Create Cluster and Secret
- GKE cluster in Autopilot mode, run the following command:
gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORKCreating the GKE cluster might take some time to complete.
2. Connect to cluster
gcloud container clusters get-credentials $CLUSTER_NAME \
--location=$REGION3. Configure a secret for hugging face
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HUGGING_FACE_TOKEN}Deploy stuff
Now you can deploy stuff in this case pods to runn vLLM and server model, service ClusterIP to expose the workload and a pod monitoring definition to get metrics from the vLLM container.
- Create a deployment manifest called llamadeploy.yaml with the following content.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama4-vllm
spec:
replicas: 1
selector:
matchLabels:
app: llama4
template:
metadata:
labels:
app: llama4
ai.gke.io/model: llama-4-scout-17b
ai.gke.io/inference-server: vllm
spec:
containers:
- name: inference-server
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250722_0916_RC01
resources:
requests:
cpu: "10"
memory: "128Gi"
ephemeral-storage: "240Gi"
nvidia.com/gpu: "8"
limits:
cpu: "10"
memory: "128Gi"
ephemeral-storage: "240Gi"
nvidia.com/gpu: "8"
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=8
- --host=0.0.0.0
- --port=8000
- --max-model-len=4096
- --max-num-seqs=4
env:
- name: MODEL_ID
value: meta-llama/Llama-4-Scout-17B-16E-Instruct
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 5
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
cloud.google.com/reservation-name: RESERVATION_URL
cloud.google.com/reservation-affinity: "specific"
cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
name: llm-llama-service
spec:
selector:
app: llama4
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: llama4-vllm-monitoring
spec:
selector:
matchLabels:
app: llama4
endpoints:
- port: 8000
path: /metrics
interval: 30sps. Ensure you put in the name of your reservation in place of RESERVATION_URL for this to work
2. Run the deployment
kubectl apply -f llamadeploy.yaml3. You can monitor the deployment using various commands, example
kubectl get deployment #show all deploymentsKubectl get pods #shows all podskubectl describe deployment ADD_DEPLOYMENT_NAME #shows you the detailskubectl describe pod ADD_POD_NAME #shows you the details of podkubectl logs ADD_POD_NAME #shows you the pod logs
Test Inference
You can may a call to the LLM with a simple test
- Set up port forwarding to Llama 4 Scout:
kubectl port-forward service/llm-service 8000:8000- Open a new terminal window. You can then chat with your model by using curl:
curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{
"role": "user",
"content": "Describe a sailboat in one short sentence?"
}
]
}'You should get a creative reply straight from your powerful new AI endpoint! From here, you could expose this service publicly with a Load Balancer and build a Streamlit app to interact with it.
To clean up do the following
- Delete the deployment and secret
kubectl delete -f llamadeploy.yaml
kubectl delete secret hf-secret2. Delete the cluster
gcloud container clusters delete $CLUSTER_NAME \
--region=$REGIONLearn More
You can follow this tutorial via documentation
➡️ https://cloud.google.com/ai-hypercomputer/docs/tutorials/vllm-gke-llama4

