Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Scale-to-Zero LLM Inference with vLLM, Cloud Run and Cloud Storage FUSE

--

TLDR: Want to run LLMs like DeepSeek R1 in production without breaking the bank? This article shows you how to use vLLM on Google Cloud Run for scale-to-zero inference, meaning you only pay for GPUs when actively using the model. We’ll also cover how to easily deploy and serve models from Hugging Face leveraging Cloud Storage FUSE, making internal LLM tools (like a private ChatGPT) and R&D projects much easier to setup.

The landscape of Large Language Models (LLMs) is rapidly evolving. Recent advancements, such as DeepSeek R1, have ignited discussions about the capabilities of open-source models and their competitiveness with leading proprietary alternatives. Concurrently, we’re witnessing a surge in individuals and organizations exploring LLMs, from enthusiasts experimenting on personal hardware to businesses seeking private, secure AI solutions tailored to their data.

While tools like Ollama simplify the process of testing various LLMs from platforms like Hugging Face, deploying these models with scaling in mind demands a more robust and optimized solution.

This article introduces vLLM, a powerful open-source library designed for production-grade serving and inference of Large Language Models (LLMs). It demonstrates how to use vLLM in a serverless environment like Google Cloud Run to achieve scale-to-zero functionality. Additionally, we’ll examine the use of Cloud Storage FUSE, which allows seamless integration of models from Hugging Face — in this case, a distilled version of DeepSeek R1 — directly into vLLM for serving, simply by mounting the model weights.

This deployment strategy enables organizations to optimize GPU resource allocation by dynamically scaling GPU instances based on real-time demand. Consequently, computational resources, and their associated costs, are only incurred during periods of active model inference, eliminating expenses during idle periods. Such solution is particularly useful for PrivateAI use cases like private ChatGPT alternatives for employees, and for research and development projects where computational needs may be intermittent or unpredictable.

For those seeking a production-ready, always-on LLM deployment with the flexibility to scale resources as needed, explore Vertex AI Endpoints (some useful resources in the references section), or wait for the next articles where I will talk about Ray and vLLM 😀.

🥜 vLLM in a nutshell

vLLM is a powerful, free, and open-source tool designed to make LLMs run fast and efficiently in real-world applications. While both vLLM and Ollama aim to simplify working with LLMs, they serve different purposes and target different use cases. Ollama excels at providing a user-friendly interface for experimenting with and running various LLMs locally. It’s great for quickly trying out different models and exploring their capabilities. vLLM, on the other hand, is engineered for production-ready LLM serving and inference, as mentioned also on Google Cloud documentation.

vLLM utilizes sophisticated techniques like paged attention, a key optimization borrowed from operating system memory management. This approach allows vLLM to efficiently handle long text sequences and numerous simultaneous requests, by optimizing the usage of the KV Cache. By managing memory in a similar way to how operating systems handle virtual memory and paging, vLLM can significantly improve performance, especially when serving multiple users or processing complex prompts. This results in much higher throughput and lower latency compared to other LLM serving solutions like Ollama, considering also that vLLM can be deployed in cluster mode to distribute the load across multiple GPUs.

I will leave some useful links in the references section if you want to deep dive in how vLLM works under the hood.

📐 Serverless architecture overview

Cloud Run , based on Knative, simplifies container deployment and management, offering autoscaling capabilities, including scaling down to zero instances when idle. Recent enhancements, such as GPU support and Cloud Storage volume mounts, have further solidified Cloud Run’s suitability for AI and machine learning deployments.

Then, why not leverage Cloud Run’s GPU capabilities to run vLLM? And what about plugging in model weights stored in a Cloud Storage bucket using GCS volume mounts?

Let’s take a look at the following diagram:

Architecture overview

vLLM provides a base Docker image that can be used to deploy it anywhere. This image includes an OpenAI-compatible server for interacting with any underlying model loaded at startup time. In this scenario, we will leverage Cloud Build and Artifact Registry to build and push the image in a private docker repository.

Separately, LLM models are downloaded from Hugging Face via another Cloud Build pipeline and then stored in a Cloud Storage bucket (See the references for information on securely accessing gated models using Secret Manager). These model weights are then mounted to the Cloud Run instance and loaded by vLLM during startup.

Cloud Run & GPUs

GPU support is in Pre-GA, and can be used in specific GCP region for AI inference workloads. Currently, the only available GPU is the NVIDIA L4, equipped with 24 GB of dedicated VRAM, separate from the instance’s main memory.

Cloud Run GCS volume mount (Cloud Storage FUSE)

Cloud Storage FUSE is an open-source tool backed by Google, that seamlessly integrates Cloud Storage buckets into local file system by leveraging FUSE and the Cloud Storage APIs.

As mentioned in the documentation, Cloud Storage FUSE is well-suited for machine learning workloads, including storing training data, checkpoints, and, importantly, model weights. Cloud Run’s support for GCS volume mounts leverages Cloud Storage FUSE under the hood, making it a natural fit for serving LLMs.

Another option I often see is to store models’ weights within the docker image. The two approaches have pros & cons, that are well explained in this table:

From Google Cloud — GPU best practices

I preferred to use a Cloud Storage FUSE volume mount to have a model-agnostic vLLM image. Indeed, In the proposed configuration, changing models is straightforward: simply download the new weights, update the model name in the environment variable, and that’s it! Your new model is ready to go, reusing the same vLLM image.

If you store the weights in your docker image you have to rebuild the image every time, but the loading of the model will be faster.

🥽 Let’s deploy !

Run the following steps on your favorite shell, like Cloud Shell.

Step 0: Init some environment variables

Specify the name of the service account you wish to create, the region, labels (so your FinOps team knows what you’re working on, right?), the Hugging Face model name (a distilled version of DeepSeek R1 in this instance), and the name of the bucket where the weights will be stored.

SERVICE_ACCOUNT=<SERVICE_ACCOUNT_NAME>
REGION=<GCP_REGION_OF_CHOICE>
LABELS=purpose=research,owner=danilotrombino
DOCKER_REPO_NAME=private-ai-docker-repo
LLM_MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
PROJECT_ID=<GCP_PROJECT_ID>
MODELS_BUCKET_NAME=<BUCKET_NAME>
RUN_NAME=<CLOUR_RUN_NAME>

Step 1: Create a Service Account for your Cloud Run

gcloud iam service-accounts create $SERVICE_ACCOUNT \
--display-name "Cloud Run vLLM Model Serving SA" --project $PROJECT_ID

Step 2: Create a Docker repository on Artifact Registry

I suggest to enable vulnerability scanning, it’s a life-saver, trust me!

gcloud artifacts repositories create $DOCKER_REPO_NAME \
--repository-format docker \
--project $PROJECT_ID \
--location $REGION \
--labels=$LABELS \
--allow-vulnerability-scanning

Step 3: Write the Dockerfile for the vLLM server

Ensure you parameterize configurations like MODEL_NAME, GPU_MEMORY_UTILIZATION, and MAX_MODEL_LEN. These are the parameters you'll likely need to adjust when adapting to different models.

FROM vllm/vllm-openai:latest

ENV HF_HOME=$MODEL_DOWNLOAD_DIR
ENV HF_HUB_OFFLINE=1

ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server \
--port ${PORT:-8080} \
--model $MODEL_NAME \
--trust-remote-code \
--gpu-memory-utilization ${GPU_MEMORY_UTILIZATION:-0.95} \
${MAX_MODEL_LEN:+--max-model-len "$MAX_MODEL_LEN"}

Step 4: Build & Push the image to Artifact Registry

For convenience, I used the following command as a shortcut to build the image.

gcloud builds submit \
--region $REGION \
--project $PROJECT_ID \
--tag ${REGION}-docker.pkg.dev/${PROJECT_ID}/${DOCKER_REPO_NAME}/vllm

Step 5: Create a Google Cloud Storage bucket

gcloud storage buckets create gs://${MODELS_BUCKET_NAME} --location $REGION --project $PROJECT_ID

Step 6: Prepare the Cloud Build pipeline to download HF models

There are several ways to download a model from Hugging Face. For speed, this is the simplest approach. You can configure a parameterized pipeline so you only need to provide the model name (see the links below for information on working with gated models).

steps:
- name: python
id: download_hf_model
entrypoint: 'bash'
args:
- -c
- |
pip install -U huggingface_hub[hf_transfer]
export HF_HOME=/workspace/models-cache
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download ${_MODEL_NAME}
- name: 'gcr.io/cloud-builders/gcloud'
id: move_model_to_gcs
args:
- storage
- cp
- --recursive
- /workspace/models-cache/*
- gs://${_DESTINATION_BUCKET}

Step 7: Run the pipeline

gcloud builds submit \
--region $REGION \
--project $PROJECT_ID \
--substitutions=_MODEL_NAME=${LLM_MODEL},_DESTINATION_BUCKET=${MODELS_BUCKET_NAME} \
--config=cloudbuild.yaml

Step 8: Be sure you have GPU quotas for Cloud Run

Before to deploy your cloud run service, check your Cloud Run GPU quotas.

Cloud Run GPU quotas

Step 9: Deploy the Cloud Run service

Now it’s time to deploy! Don’t forget to mount the GCS volume and set the minimum replicas to 0 to save costs when the model isn’t in use.

gcloud beta run deploy $RUN_NAME \
--project $PROJECT_ID \
--image ${REGION}-docker.pkg.dev/${PROJECT_ID}/${DOCKER_REPO_NAME}/vllm \
--execution-environment gen2 \
--cpu 8 \
--memory 32Gi \
--gpu 1 --gpu-type=nvidia-l4 \
--region $REGION \
--service-account $SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com \
--no-allow-unauthenticated \
--concurrency 20 \
--min 0 \
--max-instances 3 \
--no-cpu-throttling \
--add-volume=name=vllm_mount,type=cloud-storage,bucket=$MODELS_BUCKET_NAME \
--add-volume-mount volume=vllm_mount,mount-path=/mnt/hf_cache \
--set-env-vars=HF_HOME=/mnt/hf_cache,MODEL_NAME=$LLM_MODEL \
--labels=$LABELS \
--timeout=60

If your want to optimize module loading from the GCS volume, and your network configuration supports that, then here there is a cool trick (thanks Wietse Venema):

Make sure you have a Private Service Access pointing to the Google Storage APIs (here’s the link with the instructions, that can vary depending on your network configuration).

Then add these three flags to the deploy command to enable the Direct VPC egress:

--network=YOUR_VPC \
--subnet=YOUR_SUBNET \
--vpc-egress=all-traffic \

This network optimization can improve the throughput up to 10x 🤯.

Step 10: Check that vLLM is loading the model from the GCS volume

Check the Cloud Run logs, is the model correctly loaded?

Note: Model loading can take some time, resulting in a slow cold start if your minimum replica count is zero. Therefore, depending on your requirements, it might be more beneficial to set the minimum replicas to 1 and then scale down to zero automatically during weekends or off-peak hours.

Cloud Run logs show that the model is being loaded from GCS volume mount

Step 11: Connect to the Cloud Run service

You can create a tunnel between your pc and the cloud run instance by running the following command:

gcloud run services proxy $RUN_NAME --region $REGION --project $PROJECT_ID

Step 12: Ask something to the model

Now that you have vLLM exposed to your machine, you can send OpenAI-compatible HTTP requests to test that everything is working as expected:

curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"prompt": "Google Cloud Run is a",
"max_tokens": 128,
"temperature": 0.90
}'

If you see a result like the following, then congratulation! everything is working fine 👏

vLLM response

Conclusions

Cloud Run does not require pre-provisioning infrastructure to accommodate for anticipated peak usage. Indeed, you pay just for the resource you use, including the GPU!

Cloud Run GPU pricing

Given vLLM’s high throughput, this means that a single active GPU can serve numerous concurrent users, significantly reducing the per-user cost of your private AI deployment.

Richard He (highly recommend his channel — link below) illustrates how this approach delivers a ChatGPT-like experience at a fraction of the cost of a ChatGPT subscription, making this solution highly competitive, particularly for organizations prioritizing data privacy and wishing to avoid transmitting sensitive information over the internet.

What do you think? I’d love to hear your thoughts, and please don’t hesitate to reach out if you have any questions at all!

🔗 Useful links & References

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Danilo Trombino
Danilo Trombino

Written by Danilo Trombino

Google Cloud Architect @ Datwave. Specialized in data-driven solutions for global partners. Love for music and HiFi audio.

Responses (4)