FastAPI on GKE with GPU: A Step-by-Step Guide

5 min readFeb 5, 2024

To run a FastAPI application that requires GPU support on Google Cloud Run, which itself does not natively support GPUs, you would need to integrate it with Google Kubernetes Engine (GKE) that does support GPUs. Here’s a general outline of how you could do this:

Set up Google Kubernetes Engine (GKE)

Create a GKE cluster in your Google Cloud project. Make sure to configure the cluster with GPU support. This involves selecting the appropriate machine types and enabling the necessary APIs.
Install the NVIDIA GPU device drivers on the nodes of your cluster, as GKE does not pre-install these drivers.

Containerize your FastAPI application

Create a Docker container for your FastAPI application. Ensure that your Dockerfile includes all the necessary dependencies, including those for GPU usage.

# Use an official Python runtime with NVIDIA CUDA support
FROM nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04

WORKDIR /app

# Copy your application files
COPY . /app

# Install Python and FastAPI dependencies
RUN apt-get update && apt-get install -y python3-pip && \
    pip3 install --no-cache-dir -r requirements.txt

CMD uvicorn main:app --port=8000 --host=0.0.0.0

Build and push your Docker image to a container registry, such as Google Container Registry (GCR).

Deploy to GKE

Write a Kubernetes deployment configuration that specifies your Docker image and the required GPU resources. This will include setting the appropriate resource limits for GPUs.
Deploy your application to the GKE cluster using the kubectl command-line tool.

Expose your FastAPI application

Expose your application to the internet by creating a Kubernetes service of type LoadBalancer. This will provide you with an external IP address to access your FastAPI application.
Alternatively, you could use an Ingress controller for more advanced routing and load balancing features.

Monitor and manage

Utilize Google Cloud’s monitoring and logging tools to keep track of your application’s performance and troubleshoot any issues.
Manage scaling and updates as needed based on your application’s usage and requirements.

This approach allows you to leverage the scalability and management features of Google Cloud Run while also utilizing the GPU capabilities of GKE. Keep in mind that managing Kubernetes requires a bit of expertise, so you might need to familiarize yourself with Kubernetes concepts and GKE specifics.

When setting up a Google Kubernetes Engine (GKE) cluster with GPU support, it is important to understand the distinction between serverless and traditional managed services. GKE is a managed Kubernetes service but it is not serverless in the same way as Google Cloud Run or Google Cloud Functions. Instead, it manages Kubernetes clusters which run on Compute Engine instances.

In the context of GKE with GPU support:

Compute Engine Instances: GKE uses Compute Engine instances for its nodes. When you set up a GKE cluster, especially with GPUs, you are essentially configuring these instances to suit your requirements (such as choosing the right machine type and GPU type).
Auto-Scaling: GKE offers auto-scaling, which automatically adjusts the number of nodes in your cluster based on workload. This means that while the instances do not automatically shut off when not in use, they can scale down to a minimum number of nodes you specify, helping to manage costs and resources efficiently.
Managed but not Serverless: GKE provides a managed Kubernetes environment, handling tasks like cluster creation, scaling, and upgrades. However, unlike serverless platforms where you don’t manage the underlying servers at all, with GKE, you still have some level of control and responsibility over the cluster configuration and scaling.
GPU Utilization: For GPU-intensive tasks, serverless options are limited and might not provide the necessary control over hardware. GKE with GPU support offers a more tailored solution, as it allows for specific GPU configurations which are essential for certain types of workloads, like machine learning or data processing tasks.

Using GKE with GPU support means managing a Kubernetes cluster with nodes that have GPU capabilities. It offers more control and customization compared to serverless options, but also requires managing aspects like scaling and instance types. It’s not serverless in the traditional sense, but it does offer managed services to ease Kubernetes and GPU utilization.

To achieve a serverless-like experience with GPU support where you don’t pay for resources when they’re not in use, you can explore a few options. However, it’s important to note that true serverless computing with GPUs is a complex and not always directly available scenario due to the nature of GPU resources. Here are some approaches:

GKE Autopilot Mode: Google Kubernetes Engine (GKE) Autopilot is a hands-off approach to managing Kubernetes clusters. It automatically manages and scales the underlying infrastructure. While it is not purely serverless, it simplifies much of the overhead involved in cluster management. You can set up an Autopilot cluster with GPU-enabled nodes, but be aware that the cost model still involves paying for the resources provisioned.
Use Preemptible VMs in GKE: Preemptible VMs are short-lived instances that can be used in GKE. They are much cheaper than regular instances but can be terminated at any time. You can configure a node pool in GKE with preemptible VMs that have GPUs. This way, you can reduce costs significantly, though it’s not exactly “pay-for-what-you-use” like in serverless models.
Custom Serverless Solution: Implement a custom solution where GPU workloads are scheduled on-demand on Compute Engine instances with GPUs. You can use Cloud Functions or App Engine to trigger these instances, run the necessary computations, and then shut them down. This requires a more complex setup and careful management to ensure you are only running (and paying for) these instances when needed.
Third-Party Solutions: There might be third-party platforms or services that offer a more serverless-like experience with GPU support. These services could provide an API to run GPU workloads without managing the underlying infrastructure.
Regular Monitoring and Autoscaling: Implement rigorous monitoring and autoscaling strategies to scale down to zero or minimum instances when GPU resources are not in use. This requires a carefully configured environment that can automatically scale up and down based on workload demands.

Achieving a serverless model with GPU support involves trade-offs and often requires a more hands-on approach to manage and optimize costs. It’s about finding the right balance between cost, performance, and ease of management for your specific GPU workloads.