Deploy stable diffusion on GPU instance using FastAPI
In this blog, let’s explore how we can deploy a Stable diffusion model on a GPU and expose it as an API. To do this we will use
- GPU instance using Jarvislabs
- FastAPI for developing API
- Gunicorn to run the FastAPI application
The demo code is available in the GitHub repo.
GPU instance
I will be using Jarvislabs GPU instances, which come with
- Cuda, Nvidia, and Pytorch libraries are preinstalled.
- Build applications using your favorite IDE like JupyterLab and VScode.
- Expose API using FastAPI in clicks.
For this blog, let's use a PyTorch-powered instance.
Most of the below steps are applicable to local or other cloud platforms.
Setting up environment
Let's install some of the additional software required for FastAPI, stable diffusion, and diffusers (Hugging face library).
Run the below command from a terminal.
pip install -r requirements.txt
Which installs the below libraries.
fastapi==0.85.0
uvicorn==0.18.3
diffusers==0.6.0
gunicorn==20.1.0
boto3==1.24.90
transformers==4.23.1
ftfy==6.1.1
Expose the Stable diffusion model as a rest API
Let's explore the main.py from here.
It primarily contains code to:
Load the Stable diffusion model
The above code is responsible for downloading the model weights and pushing the model to the GPU. I am using float16
as it is faster to download and also runs faster on most modern GPUs. Ensure that you do not include this code inside your API code, as it will cause the creation of new models which can fill your GPU memory in no time.
We want the model to be loaded once per application/worker.
In order to download the Stable diffusion model, you need to accept the user agreement and should have logged in. You can find/create the token for login here
huggingface-cli login
Expose the Stable diffusion model as a RestAPI
We are creating a rest API genimage
which is responsible for accepting an input request containing prompt and guidance_scale. The model(pipe) uses the prompt text and guidance_scale values to generate the image.
You can modify the GenImage
to include other parameters in the model like
- height
- width
- seed
- num_inference_steps
The generated image can be shared back to the client in multiple ways, in our case we upload the image and share the URL as a response.
Deploy the FastAPI application using Gunicorn
Once we are done with the code, we can either use unicorn
or gunicorn
to deploy the application.
Gunicorn lets us run the application on multiple workers and also queues the requests. So let's deploy our FastAPI app on gunicorn.
gunicorn main:app --bind=0.0.0.0:6006 -w 4 -k uvicorn.workers.UvicornWorker --timeout 120
The port number 6006
could vary depending on where you are running the application.
-w 4
We are using 4 workers for demonstration, which you may have to tweak depending on the GPU memory needed. Each worker could take 8–10GB for a float16
model.
If you are trying this on Jarvislabs.ai, you can quickly get your API link by right-clicking on your running instance to use it in your application.
I hope you find the blog useful, in case you face any challenges let me know in the comments.