Cloud Run GPU: Make your LLMs serverless
Managed services, and especially serverless services, have transformed the way developers design, build, and optimize applications. Developers can now focus on code and business value, while cloud providers manage the infrastructure: resource availability, patching, scaling, and backups.
My favorite serverless service on Google Cloud is Cloud Run. It allows you to run a container at scale with just a few commands. It comes in two flavors:
- Services, for serving real-time requests for websites or APIs
- Jobs, for batch processing long-running operations
Excitingly, other flavors are on the horizon! Stay tunned!
The strength of serverless services lies in auto-provisioning. I never have to worry about server creation or updates. However, until recently, I was limited to CPU-only processing, with no support for extra hardware.
But this summer, things changed!
Cloud Run Services now supports GPUs!!
The Rise of Open-Source LLMs
With the growing popularity of generative AI, open-source LLMs are becoming more efficient and widespread. This is due to their ability to be hosted on-premises, without relying on external third parties like Google or OpenAI.
There are real-world use cases where open-source LLMs are essential:
- When companies want to prevent data exfiltration to third parties and maintain full control over their genAI components.
- When companies want to fine-tune LLMs for their specific needs. And then, host them
- When companies want to eliminate third-party limitations, quotas, and costs.
The Challenge of Scaling LLMs
However, with the increasing popularity of LLMs, GPUs have become a scarce and expensive resource, as they are essential for optimizing response latency.
Consequently, when you acquire a GPU on a cloud platform, the strategy is often to keep it running continuously, even when idle, to ensure its availability. But this approach has drawbacks.
Firstly, it’s financially costly to run a GPU 24/7, especially if you overprovision to handle peak activity.
Secondly, it’s environmentally unsustainable, as you’re consuming power and cooling resources unnecessarily, preventing other users from using them more efficiently.
Cloud Run GPU: A Game Changer
Cloud Run GPU addresses these challenges. You no longer need to overprovision or keep machines with GPUs running constantly. It’s an on-demand service that automatically scales from 0 to multiple instances (currently 7 in private preview, with more to come).
Thanks to this design, you only use and pay for GPUs when needed, scaling up when required and scaling down to 0 when idle. It’s all managed by Cloud Run, and I’m a big fan!
The power of GPU for LLMs
To demonstrate this, I deployed the latest small and efficient Google open-source LLM model: Gemma2 (the 2b version).
I used Ollama to run Gemma on Cloud Run with and without GPUs. Ollama conveniently adapts the runtime to the available hardware and comes with all the latest drivers pre-installed, saving me a lot of hassle!
Container Content
For this experiment, I kept things simple to validate the value of GPUs on Cloud Run.
Here the main.py
python code to serve the Cloud Run service.
The webservice gets the prompt from the JSON body (in the prompt
JSON key) and sends the LLM response back
import os
from flask import Flask, request
from llama_index.llms.ollama import Ollama
llm = Ollama(model="gemma2:2b")
app = Flask(__name__)
@app.route('/', methods=['POST'])
def call_function():
body = request.get_json(force=True)
prompt = body['prompt']
response = llm.complete(prompt)
return f"{response}"
if __name__ == "__main__":
app.run(host='0.0.0.0',port=int(os.environ.get('PORT',8080)))
This code statically uses the gemma2:2b version but can be made dynamic with environment variables if needed.
This code is packaged in a container. I started from the official Ollama image to ensure the default configuration and pre-installed NVidia drivers.
Then, I pulled the gemma2:2b
model and installed Python to run the code.
FROM ollama/ollama
RUN bash -c "ollama serve &" && sleep 4 && ollama pull gemma2:2b
RUN apt-get update && apt-get install -y python3 pip
WORKDIR /app
RUN pip install --no-cache-dir flask ollama llama_index llama-index-llms-ollama
ENV PORT 8080
COPY . .
ENTRYPOINT bash -c "ollama serve &" && sleep 4 && python3 main.py
Note that ollama runs a webservice and I have to run it in the background to interact with it, for the model pulling and for the entrypoint, at runtime.
The deployment
With the code complete, you need to build and deploy the container. I personally use Cloud Build:
gcloud builds submit --tag gcr.io/<project-id>/ollama/gemma2 .
Next, deploy the container using gcloud CLI 488 or later (GPU options are in the beta CLI for the version 488).
You can deploy with or without a GPU; Ollama’s code adapts to the available hardware. Remember, it’s pay-as-you-use, so multiple services won’t cost more if they’re not used.
# CPU only
gcloud run deploy ollama-gemma2 --image gcr.io/<project-id>/ollama/gemma2 \
--platform managed --region us-central1 --allow-unauthenticated \
--memory 16Gi --cpu 4 --timeout 600s --execution-environment=gen2
# With GPU
gcloud beta run deploy ollama-gemma2-gpu --image gcr.io/<project-id>/ollama/gemma2 \
--platform managed --region us-central1 --allow-unauthenticated \
--memory 16Gi --cpu 4 --timeout 600s --execution-environment=gen2 \
--gpu=1 --no-cpu-throttling --gpu-type=nvidia-l4 --max-instances=1
Deployment specifics:
- Ollama requires at least 4GB of memory, but Cloud Run’s GPU option needs 16GB. I set both to 16GB for performance comparison.
- For responses in under 30 seconds (Ollama’s default timeout), I set 4 CPUs, especially for CPU-only processing.
- The max instance parameter, required for GPU, must be 7 or lower (a preview limitation).
- Only
us-central1
currently supports GPUs (a preview limitation). - Only gen2 works with GPU and Ollama (gen1 is sandboxed, limiting CPU feature use).
- For testing, I allowed unauthenticated connections to my Cloud Run service. Do not do the same in production!
Performance tests
I used a simple prompt with few tokens to avoid hitting Ollama’s 30-second timeout.
# CPU only
curl -X POST -d '{"prompt":"tell me 1"}' \
https://ollama-gemma2-<project hash>-uc.a.run.app
# With GPU
curl -X POST -d '{"prompt":"tell me 1"}' \
https://ollama-gemma2-gpu-<project hash>-uc.a.run.app
Responses vary, but that’s not the focus here. Feel free to experiment with different prompts.
Here are the performance results:
- CPU Only
First request (load model): 25s average
First prompt (without cache): 12.5s average
Cached prompt: <1s average - With GPU
First request (load model): 4.5s average
First prompt (without cache): 1.3s average
Cached prompt: <1s average
Cost of extra hardware
Adding an extra component on Cloud Run incurs additional costs. Currently, the price for GPUs in tier-1 regions is $0.000233 per second and per GPU (this is subject to change).
The main cost driver isn’t the GPU itself, but the required parameter --no-cpu-throttling
during deployment. This parameter significantly affects Cloud Run costs.
Normally, Cloud Run bills for CPU and memory resources only during request processing (capped at 100ms). Outside of processing, resources are throttled and not billed.
However, --no-cpu-throttling
keeps resources active even when not processing requests, up until instance deletion (after 15 minutes of inactivity). This means you’re charged for resources even when they’re not actively handling requests.
More detail in the Cloud Run pricing documentation
The Cloud Run pricing table leads us to this comparison
- CPU Only: (CPU (0.000024) + Memory (0.0000025)) * request duration
About: $0.00033125 per request - With GPU: (CPU (0.000018) + Memory (0.0000020) + GPU (0.000233)) * request duration
About: $0.0003289 per request
While --no-cpu-throttling
offers 25% cheaper CPU and 20% cheaper memory, the overall cost per request remains similar due to the GPU being 10 times more expensive than the CPU, but also processing requests 10 times faster.
The overhead arises from the 15 minutes of idle time before instance shutdown, costing $0.23 for one GPU. However, GPUs offer significantly lower latency (1s on average compared to 10s with CPU).
Think about the user experience: chatting with a 1s latency bot VS a 10s latency one.
GPUs’ instances also enable 10x more requests to be processed per instance.
Leverage the power of GPU
GPU support in Cloud Run is a game-changer for unpredictable workloads requiring GPUs, such as serving open-source LLMs. Cloud Run automatically scales the number of instances to match workload demands.
What GPU-powered workload will you run on Cloud Run?