Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Inferencing — Serve Llama 4 on A4 (B200 GPUs) using vLLM and GKE

4 min readAug 13, 2025

--

Press enter or click to view image in full size
generated by ai

The worlds of Artificial Intelligence and high-performance computing are not just colliding; they’re fusing. With state-of-the-art hardware like NVIDIA’s B200 GPUs and incredibly capable models like Meta’s Llama 4, the possibilities are expanding daily. In this guide, we’ll walk through how to deploy the Llama 4 Scout model on a Google Cloud A4 VM powered by eight B200 GPUs. We’ll use a GKE Autopilot cluster and vLLM for inference.

You can also jump right to the Google Cloud Documentation with the tutorial here ✅ — Use vLLM on GKE to run inference with Llama 4

Let’s get started!

Here’s what we’ll accomplish:

  • Select the Llama 4 model from Hugging Face.
  • Deploy a GKE Autopilot cluster with a powerful A4 node pool.
  • Use vLLM to serve our model efficiently.
  • Configure podmonitoring resource to get metrics from our vLLM server.
  • Expose the model internally with a ClusterIP service.
  • Run a quick inference test to see it all in action.

Prerequisites

You’ll need a Google Cloud project with billing enabled and a reservation for the A4 machine type to follow this guide.

Press enter or click to view image in full size
gemini generated

Set up your environment

Select or create a project to use for your resources and billing

  • Enable the following API
gcloud services enable container.googleapis.com
  • Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.admin
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Configure variables

gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAME

Select model from Hugging face

  1. Sign into your Hugging Face account https://huggingface.co/login
  2. Navigate to Meta LLama 4 model (Llama-4-Scout-17B-16E-Instruct)
  3. You’ll need to access the licenses agreement to get access to Meta’s models.
  4. Next create a token, Click Your Profile > Settings > Access Tokens> +Create new token
  5. Specify a Name of your choice and a Role of at least Read.
  6. Select Generate a token.
  7. Copy the generated token to your clipboard for later use.

Create Cluster and Secret

  1. GKE cluster in Autopilot mode, run the following command:
gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORK

Creating the GKE cluster might take some time to complete.

2. Connect to cluster

gcloud container clusters get-credentials $CLUSTER_NAME \
--location=$REGION

3. Configure a secret for hugging face

kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HUGGING_FACE_TOKEN}

Deploy stuff

Now you can deploy stuff in this case pods to runn vLLM and server model, service ClusterIP to expose the workload and a pod monitoring definition to get metrics from the vLLM container.

  1. Create a deployment manifest called llamadeploy.yaml with the following content.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama4-vllm
spec:
replicas: 1
selector:
matchLabels:
app: llama4
template:
metadata:
labels:
app: llama4
ai.gke.io/model: llama-4-scout-17b
ai.gke.io/inference-server: vllm
spec:
containers:
- name: inference-server
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250722_0916_RC01
resources:
requests:
cpu: "10"
memory: "128Gi"
ephemeral-storage: "240Gi"
nvidia.com/gpu: "8"
limits:
cpu: "10"
memory: "128Gi"
ephemeral-storage: "240Gi"
nvidia.com/gpu: "8"
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=8
- --host=0.0.0.0
- --port=8000
- --max-model-len=4096
- --max-num-seqs=4
env:
- name: MODEL_ID
value: meta-llama/Llama-4-Scout-17B-16E-Instruct
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 5
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
cloud.google.com/reservation-name: RESERVATION_URL
cloud.google.com/reservation-affinity: "specific"
cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
name: llm-llama-service
spec:
selector:
app: llama4
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: llama4-vllm-monitoring
spec:
selector:
matchLabels:
app: llama4
endpoints:
- port: 8000
path: /metrics
interval: 30s

ps. Ensure you put in the name of your reservation in place of RESERVATION_URL for this to work

2. Run the deployment

kubectl apply -f llamadeploy.yaml

3. You can monitor the deployment using various commands, example

kubectl get deployment #show all deployments
Kubectl get pods #shows all podskubectl describe deployment ADD_DEPLOYMENT_NAME #shows you the detailskubectl describe pod ADD_POD_NAME #shows you the details of podkubectl logs ADD_POD_NAME #shows you the pod logs

Test Inference

You can may a call to the LLM with a simple test

  1. Set up port forwarding to Llama 4 Scout:
kubectl port-forward service/llm-service 8000:8000
  1. Open a new terminal window. You can then chat with your model by using curl:
curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{
"role": "user",
"content": "Describe a sailboat in one short sentence?"
}
]
}'

You should get a creative reply straight from your powerful new AI endpoint! From here, you could expose this service publicly with a Load Balancer and build a Streamlit app to interact with it.

To clean up do the following

  1. Delete the deployment and secret
kubectl delete -f llamadeploy.yaml
kubectl delete secret hf-secret

2. Delete the cluster

gcloud container clusters delete $CLUSTER_NAME \
--region=$REGION

Learn More

You can follow this tutorial via documentation
➡️ https://cloud.google.com/ai-hypercomputer/docs/tutorials/vllm-gke-llama4

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Ammett W
Ammett W

Written by Ammett W

DevRel Cloud AI Infra/Networking @ Google | Founder of Start Cloud Now | CCIE#43659, CISSP, Inspiring people as I go along my journey. Learn, Do your best.

No responses yet