Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

AI Inferencing — Serve DeepSeek v3.1 Base on Google Cloud A4 (B200 GPUs) using vLLM and GKE

5 min readSep 30, 2025

--

Press enter or click to view image in full size

DeepSeek changed the LLM game when it was first released. In this blog demo you’ll run the new DeepSeek v 3.1 Base model on state-of-the-art NVIDIA B200 GPUs available on Google Cloud A4 VM GPU family.

We’ll use a GKE Autopilot cluster and vLLM for inference.

You can also just jump right to the Google Cloud Documentation with the tutorial here
Use vLLM on GKE to run inference with DeepSeek-V3.1-Base

Let’s get started!

Here’s what we’ll accomplish:

  • Select the DeepSeep v3.1 Base model on Hugging Face.
  • Deploy a GKE Autopilot cluster with a powerful A4 node pool.
  • Use vLLM to serve our model efficiently.
  • Configure podmonitoring resource to get metrics from our vLLM server.
  • Expose the model internally with a ClusterIP service.
  • Run a quick inference test to see it all in action.

Prerequisites

You’ll need a Google Cloud project with billing enabled and a reservation for the A4 machine type to follow this guide. To get a future reservation for the A3ultra, A4 and A4X VM families you may need to contact you TAM or Sales team.

Press enter or click to view image in full size
generate by ai

Set up your environment

Select or create a project to use for your resources and billing

  • Enable the following API
gcloud services enable container.googleapis.com
  • Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.admin
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Configure variables

gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAME

Select model from Hugging face

  1. Sign into your Hugging Face account https://huggingface.co/login
  2. Navigate to DeepSeek model (DeepSeek v 3.1 Base)
  3. You’ll need to access the licenses agreement to get access to Meta’s models.
  4. Next create a token, Click Your Profile > Settings > Access Tokens> +Create new token
  5. Specify a Name of your choice and a Role of at least Read.
  6. Select Generate a token.
  7. Copy the generated token to your clipboard for later use.

Create Cluster and Secret

  1. GKE cluster in Autopilot mode, run the following command:
gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORK

Creating the GKE cluster might take some time to complete.

2. Connect to cluster

gcloud container clusters get-credentials $CLUSTER_NAME \
--location=$REGION

3. Configure a secret for hugging face

kubectl create secret generic hf-secret \
--from-literal=hf_token=${HUGGING_FACE_TOKEN}

Deploy stuff

Now you can deploy stuff in this case pods to run vLLM and server model, service ClusterIP to expose the workload and a pod monitoring definition to get metrics from the vLLM container.

  1. Create a deployment manifest called vllm-deepseek3-1-base.yaml with the following content.
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek3-1-deploy
spec:
replicas: 1
selector:
matchLabels:
app: deepseek
template:
metadata:
labels:
app: deepseek
ai.gke.io/model: deepseek-v3-1-base
ai.gke.io/inference-server: vllm
examples.ai.gke.io/source: user-guide
spec:
containers:
- name: vllm-inference
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250819_0916_RC01
resources:
requests:
cpu: "10"
memory: "1000Gi"
ephemeral-storage: "1Ti"
nvidia.com/gpu: "8"
limits:
cpu: "10"
memory: "1000Gi"
ephemeral-storage: "1Ti"
nvidia.com/gpu: "8"
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=8
- --host=0.0.0.0
- --port=8000
- --max-model-len=8192
- --max-num-seqs=4
env:
- name: MODEL_ID
value: deepseek-ai/DeepSeek-V3.1-Base
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 5
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
cloud.google.com/reservation-name: RESERVATION_URL
cloud.google.com/reservation-affinity: "specific"
cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
name: deepseek-service
spec:
selector:
app: deepseek
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: deepseek-monitoring
spec:
selector:
matchLabels:
app: deepseek
endpoints:
- port: 8000
path: /metrics
interval: 30s

ps. Ensure you put in the name of your reservation in place of RESERVATION_URL for this to work

2. Run the deployment

kubectl apply -f vllm-deepseek3-1-base.yaml

3. You can monitor the deployment using various commands, example

kubectl get deployment #show all deployments
Kubectl get pods #shows all podskubectl describe deployment ADD_DEPLOYMENT_NAME #shows you the detailskubectl describe pod ADD_POD_NAME #shows you the details of podkubectl logs ADD_POD_NAME #shows you the pod logs

Test Inference

You can may a call to the LLM with a simple test

  1. Set up port forwarding to Qwen3:
kubectl port-forward service/deepseek-service 8000:8000
  1. Open a new terminal window. You can then chat with your model by using curl:
curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.1-Base",
"messages": [
{
"role": "user",
"content": "Describe how generative AI works in one short and easy to understand sentence"
}
],
"stream":false
}'

This model should give youa creative reply straight from the AI endpoint once it’s finished thinking!

From here, you could expose this service publicly with a Load Balancer or gateway band build a Streamlit app to interact with it.

To clean up do the following

  1. Delete the deployment and secret
kubectl delete -f vllm-deepseek3-1-base.yaml
kubectl delete secret hf-secret

2. Delete the cluster

gcloud container clusters delete $CLUSTER_NAME \
--region=$REGION

Learn More

You can checkout the tutorials tutorials on the Google Cloud documentation here
Use vLLM on GKE to run inference with DeepSeek-V3.1-Base

Other Tutorials on A4 VM’s (Gemma, LLama 4, Qwen3, gpt-oss-120B):

Use vLLM on GKE to run inference with Qwen3
Use vLLM on GKE to run inference with Llama 4
Deploy and serve Gemma 3 27B inference with vLLM on GKE
Use vLLM on GKE to run inference with gpt-oss-120b

To connect or ask a question please check me out on LinkedIn.

I’ll be in touch

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Ammett W
Ammett W

Written by Ammett W

DevRel Cloud AI Infra/Networking @ Google | Founder of Start Cloud Now | CCIE#43659, CISSP, Inspiring people as I go along my journey. Learn, Do your best.

No responses yet