Stable Video Diffusion with Replicate

Sumit Sahoo
7 min readDec 11, 2023

--

SVD conversion of Rancho Cucamonga wallpaper by @basicappleguy

The world of AI sure is fascinating. With each passing day, we come across a new thing that wasn’t possible before. Recently StabilityAI has introduced a new model that can turn an image into video. But the tech behind it is so fascinating that I immediately wanted to try… only to fail. Why fail you ask? Well even my MacBook Pro’s M2 Max is not exactly powerful enough to do it. AI models sure do have a love relationship with Nvidia with their RTX series GPUs and CUDA cores. This is where a Mac owner becomes quite envious of the power of RTX 4090 🥹

Replicate to rescue

An envious Mac owner meets Replicate. While scratching my head on how to run stable video diffusion, I came across Replicate and suddenly it all seemed possible. If we are hosting everything on the cloud these days, why not rent a Nvidia GPU from a cloud provider and run the model with it? The idea is very simple but very complex to implement at the same time. I’m glad folks at Replicate did just that. You can opt for enterprise-grade GPUs like T4, A40, and the famous A100, etc. Below is the pricing for their usage.

Replicate GPU usage pricing (as of 11/12/2023)

You can interact with any model via API. They have SDK for Python, Node.js, SwiftUI, etc. Do explore their documentation to know more.

Stable Diffusion

Stable Diffusion is a cutting-edge text-to-image model that leverages deep learning and diffusion techniques to transform textual descriptions into photorealistic images. This groundbreaking technology empowers users to create stunning visuals with just a few words, fostering artistic expression and opening doors to exciting applications across various fields.

Stable Video Diffusion is exactly similar but the input here is an image and the output is video. The generated videos are rather short (<= 4sec), and the model may not achieve perfect photorealism. But even with these shortcomings, the outputs are sometimes jaw-dropping 🤯

A Python app to use Stable Video Diffusion

We will create a Python app that will use Stable Video Diffusion using Replicate API. We really do not want to burn our own GPU, do we? 😆

Here are the key things we will use:

  1. VS Code
  2. Gradio for UI
  3. Poetry for dependency management
  4. Replicate Python API to use Stable Video Diffusion

So let’s get started.

Step 1: Define an utility class to communicate with Replicate

Let’s code along and as usual, we will first start with core logic which takes an image path as input and generates a URL of a video hosted at Replicate using API. We will take reference from Replicate Python APIs. Don’t worry I will share the GitHub repository at the end of the article :)

from src.util.log_util import LogUtil
import replicate


class SVDUtil:
def __init__(self):
self.log = LogUtil()
self.model = "stability-ai/stable-video-diffusion:3f0457e4619daac51203dedb472816fd4af51f3149fa7a9e0b5ffcf1b8172438"

# Gradio UI can be customized to take below parameters as input

# Use svd to generate 14 frames or svd_xt for 25 frames
self.video_length = "14_frames_with_svd" # Possible values: 14_frames_with_svd, 25_frames_with_svd_xt
# Frames per second
self.frames_per_second = 6
# Decide how to resize the input image
self.sizing_strategy = "maintain_aspect_ratio" # Possible values: maintain_aspect_ratio, crop_to_16_9, use_image_dimensions
# Increase overall motion in the generated video
self.motion_bucket_id = 127
# Amount of noise to add to input image
self.cond_aug = 0.02
# Number of frames to decode at a time
self.decoding_t = 7
# Random seed. Leave blank to randomize the seed
self.seed = 0

# Generate video from image using SVD
def generate_video_from_image(self, image_path):
self.log.info("Generating video from image ...")
try:
with open(image_path, "rb") as input_image:
output = replicate.run(
self.model,
input={
"cond_aug": self.cond_aug,
"decoding_t": self.decoding_t,
"input_image": input_image,
"video_length": self.video_length,
"sizing_strategy": self.sizing_strategy,
"motion_bucket_id": self.motion_bucket_id,
"frames_per_second": self.frames_per_second,
},
)
self.log.info(f"Generated video link: {output}")
return (output, output)
except Exception as e:
self.log.error(f"Error: {e}")
return (None, "Unable to generate video from image")

As you can see from the above code, we have defined the model and a few parameters that are needed by Stable Video Diffusion. As you can see the parameters are self-explanatory.

Also, we have defined a method generate_video_from_image which uses the Replicate API to generate a video and send back the URL to that video. The generated video link remains valid for only 1 hour and post which both input and output are deleted. Prediction deletion details are documented here.

The returned output is a Tuple with the same value, one for the Video component and one for the Textbox component to show the URL.

Step 2: Define Gradio UI

Now that we have a utility class, we can proceed with UI. We need an image input and 2 outputs, one for video and one to show the video URL.

with gr.Blocks(
title="IMG2VID",
theme=self.theme,
# css=custom_css,
) as img_to_vid:
gr.Image(
"./images/logo.svg",
height=80,
width=400,
interactive=False,
container=False,
show_download_button=False,
)

gr.Textbox(
value="(SVD) Image-to-Video is a latent diffusion model trained to generate short video clips from an image conditioning. This model was trained to generate 14 frames at resolution 576x1024 given a context frame of the same size. The generated videos are rather short (<= 4sec), and the model may not achieve perfect photorealism.",
show_label=False,
interactive=False,
container=False,
),

gr.Interface(
fn=self.svd_util.generate_video_from_image,
inputs=[
gr.Image(
label="Select Image",
sources=["upload"],
type="filepath",
height=400,
),
],
outputs=[
gr.Video(
label="Generated Video",
autoplay=True,
height=400,
),
gr.Textbox(
label="Video URL",
info="URL will be valid for 1 hour only and content will be deleted after this",
show_copy_button=True,
),
],
examples=[
"./images/example/example1.png",
"./images/example/example2.png",
"./images/example/example3.png",
],
allow_flagging="never",
)

img_to_vid.queue().launch(
favicon_path="./images/favicon.ico",
debug=False,
show_api=False,
server_name="0.0.0.0",
server_port=8080,
share=False,
allowed_paths=["./images/", "./outputs/"],
)

Here we have used Gradio blocks as high-level element. For main input and output, we have used Interface which makes it easy to define input and output components.

Interface also takes a function as input which generates data that can be shown in output components. Video and Textbox component takes the output of the function self.svd_util.generate_video_from_image displays it. This is the same utility function we defined in Step 1.

Step 3: Define the environment variable

Here you need to get your own Replicate API key. You can easily get one from your accounts page after signup. You may have to enter credit card details to access the model and GPU.

Create a .env file at the root of the project with below two variables:

GRADIO_SERVER_PORT=8080
REPLICATE_API_TOKEN=your_key_here

GRADIO_SERVER_PORT is optional as we have already defined the port in the launch config.

Step 4: Run the app

I have only provided the main code snippets but you can find the complete source at my GitHub repository: https://github.com/sumitsahoo/img-to-video-svd

Feel free to Fork and make a PR if you feel you can improve it further. Happy to get feedback 🙏🏻

Since we are using Poetry for dependency management, make sure to install the dependencies first.

To run the app, use the below command:

poetry run python main.py

Alternatively, you can also use VS Code debug launch option. VS Code launch.json is included in the repository. The app should be running locally at: http://127.0.0.1:8080/

Once the app is running, it should look similar to below:

IMG2VID app running at localhost

I have included a few example images that you can use or upload your own. Please make sure to upload smaller-size images for better results. Based on sizing_strategy the image will be resized and then processed. You can download the generated video by copying the video link or by clicking on the download button on the top right of the Video component.

Here is an example output:

SVD conversion of Rocket image
SVD conversion of Rocket image

Amazing isn't it?

Replicate made all this possible with their API. It is not limited to Python, you can use Node.js or any other stack that you prefer as long as you can do a REST call nothing is stopping you 🤓

Deployment

Now that we created an app with Gradio, the next phase is how to deploy it. It’s really simple. You can use the provided Dockerfile to create an image and deploy it to any serverless component of the Cloud that accepts this image (for example Cloud Run in Google Cloud). The build steps are provided in the README file in the GitHub repository. With the multi-stage build, I made sure the resulting image only has modules that are essential and hence the size is quite small i.e. <600 MB. If you are on Apple Silicon like me, make sure to build amd64 image as most Cloud providers do not support arm64 images.

arm64 Docker image
arm64 Docker image

Conclusion

Stable Video Diffusion was just the beginning, imagine what other models we can try, now that we do not have GPU restrictions. Replicate did not pay me to write this, but it’s that good and it is only going to be better in the future. With this, I am no longer jealous that I do not possess an RTX 4090 😂

Thanks.

--

--