Supercharging Cloud Run with GPU Power: A New Era for AI Workloads

Published in

Google Cloud - Community

7 min readAug 21, 2024

Google Cloud Platform (GCP) is poised to revolutionize cloud computing by introducing GPU functionality to Cloud Run. This significant update represents a paradigm shift for developers and enterprises seeking to leverage the power of machine learning and artificial intelligence (AI) in their applications. The integration of NVIDIA L4 GPUs into Cloud Run opens up unprecedented possibilities for deploying resource-intensive workloads, such as Large Language Models (LLMs), making them more accessible and efficient than ever before.

In this comprehensive guide, we’ll explore:

1. The new GPU feature in Cloud Run

2. The advantages of using NVIDIA L4 GPUs

3. How to deploy an AI-powered backend service

4. Creating a frontend application to utilize this new capability

NVIDIA L4 GPUs: Powering the Future of Cloud Run

The introduction of NVIDIA L4 GPUs to Cloud Run marks a significant advancement for developers working with AI and machine learning models. These GPUs are specifically engineered for inference workloads, offering a substantial boost in performance while maintaining energy efficiency.

Key Features of NVIDIA L4 GPUs:

1. Performance: NVIDIA L4 GPUs deliver enhanced performance for AI and machine learning workloads, particularly in inference tasks. This makes them ideal for running large language models and other complex neural networks in real-time.

2. Energy Efficiency: Designed with sustainability in mind, L4 GPUs consume less power, reducing operational costs and aligning with eco-friendly computing practices.

3. Versatility: These GPUs are optimized for a wide range of workloads, including video processing, AI inferencing, and rendering tasks. This versatility ensures that developers can easily deploy a broad spectrum of applications on Cloud Run.

Libraries

By default, all of the Nvidia L4 driver libraries are mounted. If you want to mount a subset of drivers, you can use the Nvidia environment variable NVIDIA_DRIVER_CAPABILITIES.

Cloud Run GPU Integration: Fully Managed and On-Demand

GPU support on Cloud Run is fully managed, eliminating the need for additional drivers or libraries. This feature offers on-demand availability without requiring reservations, similar to on-demand CPU and memory allocation in Cloud Run. Key benefits include:

Instances of a Cloud Run service configured to use GPU can scale down to zero when not in use, optimizing cost efficiency.
Cloud Run instances with an attached L4 GPU can start in approximately 5 seconds, allowing the container process to utilize the GPU almost immediately.
The ability to host both the inference engine and the service frontend that accesses it within a single Cloud Run service deployment, which can be optimal for both pre-built inference engines and those trained elsewhere.

Supported Regions and Configuration

At the time of writing, GPU support is available in the following regions:

us-central1 (Iowa) (Low CO2)
Europe-west4 at GA
Asia-southeast1 at GA

Additional regions are planned for support post-general availability (GA).

Pricing Considerations

When utilizing GPU functionality in Cloud Run, consider the following pricing factors:

There are no per-request fees.
CPU must be configured as “always allocated” to use the GPU feature.
Minimum instances are charged at the full rate, even when idle.
GPU is billed for the entire duration of the instance lifecycle.

Configuration Notes:

One GPU can be configured per Cloud Run instance.
For services using sidecar containers, only one container in the service deployment has access to the GPU.
A minimum of 4 CPU cores and 16 GiB of memory is required, with 8 CPU cores and 32 GiB of memory recommended for optimal performance.

Required IAM Roles

To configure and deploy Cloud Run services with GPU, you need the following IAM roles:

Cloud Run Developer (roles/run.developer)
Service Account User (roles/iam.serviceAccountUser)

Deploying a Backend Service on Cloud Run with Ollama’s Gemma 2 Model

The introduction of GPU functionality in Cloud Run simplifies the deployment of AI models. Here’s a step-by-step guide to deploying a backend Cloud Run service that utilizes the Ollama Gemma 2 model with an NVIDIA L4 GPU:

1. Set Up Cloud Run with GPU:

Configure your Cloud Run service to use the NVIDIA L4 GPU through the Google Cloud Console.
Select the appropriate GPU option during the service creation process.

GPU option must be selected and then you will asked to enter the number of GPU

2. Deploy Ollama’s Gemma 2 Model:

Create a Flask application that interfaces with the Ollama client to generate responses using the Gemma 2 model.
Use a Dockerfile that includes both Ollama and the necessary Python dependencies.

# app.py
from flask import Flask, request, jsonify
from flask_cors import
import ollama
CORS app = Flask(__name__) CORS(app)
app = Flask(__name__)
# Initialize Ollama client
client = ollama.Client()
@app.route('/generate', methods=['POST'])
def generate_text():
 data = request.json
 prompt = data.get('prompt', '')
 
 try:
 # Generate response using Gemma 2 model
 response = client.generate(model='gemma:2b', prompt=prompt)
 return jsonify({'response': response['response']})
 except Exception as e:
 return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
 app.run(host='0.0.0.0', port=8080)

# Dockerfile
FROM ollama/ollama:latest
# Install Python and pip
RUN apk add - no-cache python3 py3-pip
# Install Flask and ollama-python
RUN pip install Flask ollama-python
# Copy the application code
COPY app.py /app/app.py
# Set the working directory
WORKDIR /app
# Run the Ollama server and the Flask app
CMD ollama serve & python3 app.py

3. Deploy and Scale:

Deploy your backend service to Cloud Run, ensuring that GPU is enabled for the service.
Cloud Run will automatically scale your application based on demand, handling varying levels of requests efficiently while maintaining responsiveness and performance.

gcloud run deploy gemma-backend - source . - region <your-region> - platform managed - cpu 4 - memory 16Gi - gpu count=1,type=nvidia-l4
gcloud alpha run deploy gemma-backend \
 - image=us-central1-docker.pkg.dev/$PROJECT_ID/$AR_CODELAB_REPO/$OLLAMA_IMAGE_NAME:latest \
 - service-account=$SA_OLLAMA_ADDRESS \
 - region=us-central1 \
 - cpu=4 \
 - memory=16Gi \
 - gpu=1 \
 - no-cpu-throttling \
 - gpu-type=nvidia-l4 \
 - no-allow-unauthenticated \
 - region us-central1 \
 - execution-environment=gen2 \
 - max-instances 2 \
 - timeout=600

Creating a Frontend Service with Node.js and Ollama-js

To complement your backend service, deploy a frontend service written in Node.js that interacts with your AI model:

1. Build the Frontend with Node.js:

Create a Node.js application using Express to serve as the user interface for interacting with your AI model.
Implement a simple HTML interface for users to input prompts and view generated responses.

2. Integrate Ollama-js:

Utilize the ollama-js library to facilitate communication between the frontend and backend services.
Implement an API endpoint in your Node.js application to handle requests to the Gemma 2 model.

// app.js
const express = require('express');
const axios = require('axios');
const app = express();
const port = process.env.PORT || 3000;
// Backend service URL from environment variable
const BACKEND_URL = process.env.BACKEND_URL || 'http://localhost:8080';
app.use(express.json());
app.use(express.static('public'));
app.post('/generate', async (req, res) => {
 try {
 const { prompt } = req.body;
 // Make a request to the backend service
 const response = await axios.post(`${BACKEND_URL}/generate`, { prompt });
 res.json(response.data);
 } catch (error) {
 console.error('Error calling backend:', error.message);
 res.status(500).json({ error: 'Failed to generate response' });
 }
});
app.listen(port, () => {
 console.log(`Frontend server listening at http://localhost:${port}`);
 console.log(`Connecting to backend at ${BACKEND_URL}`);
});
// Dockerfile
FROM node:14
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
EXPOSE 3000
CMD ["npm", "start"]
// package.json
{
 "name": "gemma-2-frontend",
 "version": "1.0.0",
 "description": "Frontend for Gemma 2 demo",
 "main": "app.js",
 "scripts": {
 "start": "node app.js"
 },
 "dependencies": {
 "express": "⁴.17.1",
 "axios": "⁰.21.1"
 }
}

3. Deploy on Cloud Run:

Deploy your Node.js frontend service on Cloud Run.
Configure the frontend service to communicate with your backend service, ensuring secure and efficient communication between the two.

gcloud alpha run deploy gemma-frontend \
 - image=us-central1-docker.pkg.dev/$PROJECT_ID/$AR_CODELAB_REPO/$FRONTEND_IMAGE_NAME:latest \
 - service-account=$SA_FRONTEND_ADDRESS \
 - no-allow-unauthenticated \
 - region us-central1 \
 - execution-environment=gen2 \
 - timeout=300 \
 - set-env-vars=BACKEND_URL=https://<backend_service_url>

Benefits of Using Cloud Run with GPU for Large Language Models

Leveraging GPU functionality in Cloud Run for deploying large language models (LLMs) offers several distinct advantages:

1. Cost-Efficiency:

NVIDIA L4 GPUs significantly reduce the cost of running inference workloads.
The energy efficiency of these GPUs translates to lower operational expenses without compromising performance.

2. Scalability:

Cloud Run’s serverless nature allows automatic scaling based on demand.
Ensure AI-powered applications remain responsive even during peak usage times.

3. Simplicity:

Simplify the deployment and management of complex AI models.
Reduce the time and complexity required to bring applications to market.

4. Flexibility:

Support a wide range of workloads, from simple machine learning models to advanced AI applications.
Provide developers with the freedom to innovate without constraints.

Conclusion

The introduction of GPU functionality in Cloud Run, powered by NVIDIA L4 GPUs, is set to revolutionize the deployment and scaling of AI workloads. This advancement offers developers and enterprises the tools and scalability needed to succeed in the rapidly evolving landscape of AI and machine learning applications.

By integrating powerful models like Ollama’s Gemma 2 with a Node.js frontend, developers can create robust, efficient, and scalable applications that push the boundaries of what’s possible in cloud computing. As AI continues to evolve, Cloud Run with GPU functionality stands at the forefront of this exciting journey, enabling the creation of innovative solutions that can transform industries and user experiences.

The combination of low cold start latency, consistent serving performance under varying loads, and the ability to scale to zero during periods of inactivity makes Cloud Run GPUs an ideal choice for organizations looking to deploy cutting-edge AI applications. As we move forward, the potential for real-time inference, generative AI, and other advanced applications on Cloud Run is boundless, promising a future where sophisticated AI capabilities are more accessible and performant than ever before.

Clap and subscribe to my medium to get updates.

Read my other technical blogs

Rahul Kumar Singh — Medium

Read writing from Rahul Kumar Singh on Medium. Staff @ SADA | Building Secure and Reliable solution for the world |…

medium.com