Deploy stable diffusion on GPU instance using FastAPI

Vishnu Subramanian
Jarvislabs.ai
Published in
3 min readOct 31, 2022

In this blog, let’s explore how we can deploy a Stable diffusion model on a GPU and expose it as an API. To do this we will use

  • GPU instance using Jarvislabs
  • FastAPI for developing API
  • Gunicorn to run the FastAPI application

The demo code is available in the GitHub repo.

Photo by Spencer Davis on Unsplash

GPU instance

I will be using Jarvislabs GPU instances, which come with

  • Cuda, Nvidia, and Pytorch libraries are preinstalled.
  • Build applications using your favorite IDE like JupyterLab and VScode.
  • Expose API using FastAPI in clicks.

For this blog, let's use a PyTorch-powered instance.

Most of the below steps are applicable to local or other cloud platforms.

Setting up environment

Let's install some of the additional software required for FastAPI, stable diffusion, and diffusers (Hugging face library).

Run the below command from a terminal.

pip install -r requirements.txt

Which installs the below libraries.

fastapi==0.85.0
uvicorn==0.18.3
diffusers==0.6.0
gunicorn==20.1.0
boto3==1.24.90
transformers==4.23.1
ftfy==6.1.1

Expose the Stable diffusion model as a rest API

Let's explore the main.py from here.

It primarily contains code to:

Load the Stable diffusion model

The above code is responsible for downloading the model weights and pushing the model to the GPU. I am using float16 as it is faster to download and also runs faster on most modern GPUs. Ensure that you do not include this code inside your API code, as it will cause the creation of new models which can fill your GPU memory in no time.

We want the model to be loaded once per application/worker.

In order to download the Stable diffusion model, you need to accept the user agreement and should have logged in. You can find/create the token for login here

huggingface-cli login

Expose the Stable diffusion model as a RestAPI

We are creating a rest API genimagewhich is responsible for accepting an input request containing prompt and guidance_scale. The model(pipe) uses the prompt text and guidance_scale values to generate the image.

You can modify the GenImage to include other parameters in the model like

  • height
  • width
  • seed
  • num_inference_steps

The generated image can be shared back to the client in multiple ways, in our case we upload the image and share the URL as a response.

Deploy the FastAPI application using Gunicorn

Once we are done with the code, we can either use unicorn or gunicorn to deploy the application.

Gunicorn lets us run the application on multiple workers and also queues the requests. So let's deploy our FastAPI app on gunicorn.

gunicorn main:app --bind=0.0.0.0:6006 -w 4 -k uvicorn.workers.UvicornWorker --timeout 120

The port number 6006 could vary depending on where you are running the application.

-w 4 We are using 4 workers for demonstration, which you may have to tweak depending on the GPU memory needed. Each worker could take 8–10GB for a float16 model.

If you are trying this on Jarvislabs.ai, you can quickly get your API link by right-clicking on your running instance to use it in your application.

I hope you find the blog useful, in case you face any challenges let me know in the comments.

--

--