AI Inferencing — Serve DeepSeek v3.1 Base on Google Cloud A4 (B200 GPUs) using vLLM and GKE
DeepSeek changed the LLM game when it was first released. In this blog demo you’ll run the new DeepSeek v 3.1 Base model on state-of-the-art NVIDIA B200 GPUs available on Google Cloud A4 VM GPU family.
We’ll use a GKE Autopilot cluster and vLLM for inference.
You can also just jump right to the Google Cloud Documentation with the tutorial here
✅ Use vLLM on GKE to run inference with DeepSeek-V3.1-Base
Let’s get started!
Here’s what we’ll accomplish:
- Select the DeepSeep v3.1 Base model on Hugging Face.
- Deploy a GKE Autopilot cluster with a powerful A4 node pool.
- Use vLLM to serve our model efficiently.
- Configure podmonitoring resource to get metrics from our vLLM server.
- Expose the model internally with a ClusterIP service.
- Run a quick inference test to see it all in action.
Prerequisites
You’ll need a Google Cloud project with billing enabled and a reservation for the A4 machine type to follow this guide. To get a future reservation for the A3ultra, A4 and A4X VM families you may need to contact you TAM or Sales team.
Set up your environment
Select or create a project to use for your resources and billing
- Enable the following API
gcloud services enable container.googleapis.com- Grant roles to your user account. Run the following command once for each of the following IAM roles:
roles/container.admin
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLEConfigure variables
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAMESelect model from Hugging face
- Sign into your Hugging Face account https://huggingface.co/login
- Navigate to DeepSeek model (DeepSeek v 3.1 Base)
- You’ll need to access the licenses agreement to get access to Meta’s models.
- Next create a token, Click Your Profile > Settings > Access Tokens> +Create new token
- Specify a Name of your choice and a Role of at least
Read. - Select Generate a token.
- Copy the generated token to your clipboard for later use.
Create Cluster and Secret
- GKE cluster in Autopilot mode, run the following command:
gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORKCreating the GKE cluster might take some time to complete.
2. Connect to cluster
gcloud container clusters get-credentials $CLUSTER_NAME \
--location=$REGION3. Configure a secret for hugging face
kubectl create secret generic hf-secret \
--from-literal=hf_token=${HUGGING_FACE_TOKEN}Deploy stuff
Now you can deploy stuff in this case pods to run vLLM and server model, service ClusterIP to expose the workload and a pod monitoring definition to get metrics from the vLLM container.
- Create a deployment manifest called
vllm-deepseek3-1-base.yamlwith the following content.
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek3-1-deploy
spec:
replicas: 1
selector:
matchLabels:
app: deepseek
template:
metadata:
labels:
app: deepseek
ai.gke.io/model: deepseek-v3-1-base
ai.gke.io/inference-server: vllm
examples.ai.gke.io/source: user-guide
spec:
containers:
- name: vllm-inference
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250819_0916_RC01
resources:
requests:
cpu: "10"
memory: "1000Gi"
ephemeral-storage: "1Ti"
nvidia.com/gpu: "8"
limits:
cpu: "10"
memory: "1000Gi"
ephemeral-storage: "1Ti"
nvidia.com/gpu: "8"
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=8
- --host=0.0.0.0
- --port=8000
- --max-model-len=8192
- --max-num-seqs=4
env:
- name: MODEL_ID
value: deepseek-ai/DeepSeek-V3.1-Base
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 5
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
cloud.google.com/reservation-name: RESERVATION_URL
cloud.google.com/reservation-affinity: "specific"
cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
name: deepseek-service
spec:
selector:
app: deepseek
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: deepseek-monitoring
spec:
selector:
matchLabels:
app: deepseek
endpoints:
- port: 8000
path: /metrics
interval: 30sps. Ensure you put in the name of your reservation in place of RESERVATION_URL for this to work
2. Run the deployment
kubectl apply -f vllm-deepseek3-1-base.yaml3. You can monitor the deployment using various commands, example
kubectl get deployment #show all deploymentsKubectl get pods #shows all podskubectl describe deployment ADD_DEPLOYMENT_NAME #shows you the detailskubectl describe pod ADD_POD_NAME #shows you the details of podkubectl logs ADD_POD_NAME #shows you the pod logs
Test Inference
You can may a call to the LLM with a simple test
- Set up port forwarding to Qwen3:
kubectl port-forward service/deepseek-service 8000:8000- Open a new terminal window. You can then chat with your model by using curl:
curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.1-Base",
"messages": [
{
"role": "user",
"content": "Describe how generative AI works in one short and easy to understand sentence"
}
],
"stream":false
}'This model should give youa creative reply straight from the AI endpoint once it’s finished thinking!
From here, you could expose this service publicly with a Load Balancer or gateway band build a Streamlit app to interact with it.
To clean up do the following
- Delete the deployment and secret
kubectl delete -f vllm-deepseek3-1-base.yaml
kubectl delete secret hf-secret2. Delete the cluster
gcloud container clusters delete $CLUSTER_NAME \
--region=$REGIONLearn More
You can checkout the tutorials tutorials on the Google Cloud documentation here
✅ Use vLLM on GKE to run inference with DeepSeek-V3.1-Base
Other Tutorials on A4 VM’s (Gemma, LLama 4, Qwen3, gpt-oss-120B):
✅ Use vLLM on GKE to run inference with Qwen3
✅ Use vLLM on GKE to run inference with Llama 4
✅ Deploy and serve Gemma 3 27B inference with vLLM on GKE
✅ Use vLLM on GKE to run inference with gpt-oss-120b
To connect or ask a question please check me out on LinkedIn.
I’ll be in touch

