Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

AI/Infra — Serve gpt-oss-120b on Google Cloud A4 (B200 GPUs) using vLLM and GKE

6 min readAug 20, 2025

--

Press enter or click to view image in full size

Everyone loves gpt, now we have models like gpt-oss!! Today, we’re diving into a practical guide that will show you exactly how to deploy and run thegpt-oss-120bopen-weight wonder on Google Kubernetes Engine (GKE) autopilot using the vLLM framework. The VM used the A4 support’s 8 * NVIDIA’s B200 GPUs on Google Cloud for some serious inference action!

You can also just jump right to the Google Cloud Documentation and deploy with the tutorial here
Use vLLM on GKE to run inference with gpt-oss-120b

Let’s get started!

What We’ll Accomplish:

  • Access GPT-OSS-120b: We’ll get connected to the model via Hugging Face.
  • Prepare Your Environment: Get your Google Cloud resources primed and ready.
  • GKE Autopilot Cluster: Spin up a GKE cluster in Autopilot mode
  • Hugging Face Credentials: Securely store your Hugging Face credentials.
  • vLLM Deployment: Deploy the vLLM container to serve our model.
  • Interact and Test: Use curl to send requests to our deployed model.
  • Clean Up: Properly dismantle the resources when you’re done.

Before You Begin:

To embark on this journey, ensure you have the following in place:

  • Google Cloud Project: A Google Cloud project with billing enabled is essential.
  • A4 VM Reservation: You’ll need a reserved capacity for an A4 VM. If you don’t have one, consider exploring capacity reservation options with your Google Cloud representative.
Press enter or click to view image in full size
ai generated

Let’s Get Started!

Step 1: Setting Up Your Environment

First things first, let’s get our Google Cloud environment in order.

  1. Set up gcloud CLI: Ensure you have the Google Cloud SDK (gcloud) installed and configured. You can use your local environment or activate Cloud Shell directly in your browser for an online terminal experience.
  2. Select or Create a Google Cloud Project: Choose an existing project or create a new one. If you plan to clean up resources afterward, creating a new project is a good practice.
  3. Enable the Kubernetes Engine API: (you may need to enable others depending on your project setup)
gcloud services enable container.googleapis.com
  1. Grant IAM Roles: Assign the roles/container.admin role to your user account to manage GKE resources.
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role="roles/container.admin"

Remember to replace PROJECT_ID and USER_IDENTIFIER with your actual project ID and user email.

Step 2: Configure Your Environment Variables

Let’s set up some crucial environment variables to streamline our commands.

gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL="RESERVATION_URL" # Replace with your actual reservation URL
export REGION="YOUR_REGION" # e.g., us-central1 The regious would be based on where your reservation was made
export CLUSTER_NAME="YOUR_CLUSTER_NAME" # e.g., gpt-oss-cluster
export HUGGING_FACE_TOKEN="YOUR_HF_TOKEN" # Your Hugging Face access token
export NETWORK="default" # Or your custom network name
export SUBNETWORK="default" # Or your custom subnetwork name

Make sure to replace the placeholder values with your specific information.

Step 3: Access GPT-OSS via Hugging Face

  1. Sign in to Hugging Face: Visit huggingface.co and set up or log into your account.
  2. Create a Read Access Token: Navigate to your profile settings, then “Access Tokens,” and click “+ Create new token.” Give it a name (e.g., “gpt-oss-access”) and set the role to “Read.” Copy the generated token — you’ll need it soon!

Step 4: Create Your GKE Cluster

Now, let’s provision our GKE Autopilot cluster.

gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORK

This command will create an Autopilot cluster. The process might take a few minutes. You can verify its creation in the Google Cloud Console under Kubernetes Engine.

Step 5: Create a Kubernetes Secret for Hugging Face Credentials

To allow our deployment to access the model, we need to store your Hugging Face token as a Kubernetes secret.

  1. Configure kubectl: Connect to your newly created GKE cluster.
gcloud container clusters get-credentials $CLUSTER_NAME --region=$REGION
  1. Create the Secret:
kubectl create secret generic hf-secret \
--from-literal=hf_token=${HUGGING_FACE_TOKEN} \
--dry-run=client -o yaml | kubectl apply -f -

Step 6: Deploy the vLLM Container

We’ll define the deployment of the vLLM container using a YAML file. Create a file named vllm-gpt-oss-120b.yaml with the following content:

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-gpt-oss-deployment
spec:
replicas: 1
selector:
matchLabels:
app: gpt-oss
template:
metadata:
labels:
app: gpt-oss
ai.gke.io/model: gpt-oss-120b
ai.gke.io/inference-server: vllm
examples.ai.gke.io/source: user-guide
spec:
containers:
- name: vllm-inference
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250822_0916_RC01
resources:
requests:
cpu: "10"
memory: "128Gi"
ephemeral-storage: "240Gi"
nvidia.com/gpu: "8"
limits:
cpu: "10"
memory: "128Gi"
ephemeral-storage: "240Gi"
nvidia.com/gpu: "8"
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=2
- --host=0.0.0.0
- --port=8000
- --max-model-len=8192
- --max-num-seqs=4
env:
- name: MODEL_ID
value: "openai/gpt-oss-120b"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1200
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1200
periodSeconds: 5
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
cloud.google.com/reservation-name: $RESERVATION_URL
cloud.google.com/reservation-affinity: "specific"
cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
name: oss-service
spec:
selector:
app: gpt-oss
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: vllm-gpt-oss-monitoring
spec:
selector:
matchLabels:
app: gpt-oss
endpoints:
- port: 8000
path: /metrics
interval: 30s

Now, apply this manifest to your GKE cluster:

envsubt < vllm-gpt-oss-120b.yaml | kubectl apply -f -

The deployment process might take up to 20 minutes as the container downloads the gpt-oss-120b model from Hugging Face. You can monitor the deployment's progress:

kubectl wait \
--for-condition=Available \
--timeout=1200s deployment/vllm-gpt-oss-deployment

Other helpful commands

kubectl get pods
kubectl describe deployment ADD_DEPLOYMENT_NAME #shows you the deployment details
kubectl describe pod ADD_POD_NAME #shows you the details of pod
kubectl logs ADD_POD_NAME #shows you the pod logs

Step 7: Interact with the GPT-OSS Model

Let’s test our deployed model!

  1. Set up Port Forwarding:
kubectl port-forward service/oss-service 8000:8000

2. Send a Request with curl: Open a new terminal window and run the following command:

curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Describe a mountain goat in one short sentence?"
}
]
}'

This model will do some “stuff“ which may take a little time. You should get a creative reply straight from the AI endpoint once it’s finished!

From here, you could expose this service publicly with a Load Balancer or gateway and build a Streamlit app or other app to interact with it.

Step 8: Observe Model Performance

You can monitor the performance of your model by leveraging the vLLM dashboard integration within Cloud Monitoring. This provides insights into critical metrics like token throughput, request latency, and error rates. For detailed information on collecting metrics, consult the vLLM observability guidance in the Cloud Monitoring documentation.

Step 9: Clean Up

kubectl delete -f vllm-gpt-oss-120b.yaml
kubectl delete secret hf-secret

2. Delete the cluster

gcloud container clusters delete $CLUSTER_NAME \
--region=$REGION

You’ve successfully deployed and interacted with GPT-OSS-120b on GKE! From here, you can explore:

Learn More

You can checkout the gpt-oss-120b tutorial on the Google Cloud documentation here

Use vLLM on GKE to run inference with Qwen3

Other Tutorials on (Gemma & LLama 4):
Use vLLM on GKE to run inference with Llama 4
Deploy and serve Gemma 3 27B inference with vLLM on GKE

To connect or ask a question please check me out on I’ll be in touch

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Ammett W
Ammett W

Written by Ammett W

DevRel Cloud AI Infra/Networking @ Google | Founder of Start Cloud Now | CCIE#43659, CISSP, Inspiring people as I go along my journey. Learn, Do your best.

No responses yet