Comparing Stable Video Diffusion 1.0 and 1.1

Pedro Torruella
OctoAI
Published in
4 min readMar 1, 2024

With things getting interesting in Generative AI, especially in the AI Media Generation space, I want to take a pause and look at Stable Video Diffusion. SVD is Stability AI’s Image to Video generation model. It allows you to generate an animation a few seconds long, taking a static image as an input. It was released in November 2023 and we have already seen an update to version 1.1 earlier this month. According to the model card in Huggingface, this update came out in the way of fine-tuning, with the purpose of increasing the consistency of the outputs. I want to have an initial exploration and check what the differences are between these two versions.

A cat wearing an astronaut suit, floating in the middle of the space. One can see a red alien planet in the background with dark starry sky.
Example input image used for the comparison.

Now I want to keep perspective and be honest with you: As of today, SVD has been surpassed by what has been demonstrated by other alternatives, like OpenAI’s Sora. At least in my experience there are challenges when it comes to generating good results with human faces, or getting objects moving in the right direction.

However, there are three points to consider:

  • How much would a generation, in terms of $/frame, cost on closed source alternatives?
  • How much control or customization will one have with closed source alternatives?
  • How much quality does one need for a given application?

There are a few business tasks and use cases that could benefit from having a cost efficient image-to-video generation component, without requiring a lot of quality. Like creating e-commerce graphics for banners and social media. According to StabilityAI’s pricing page each SVD generation costs 20 of their credits, which as of today equates to $0.2, which doesn’t sound too bad to get a few seconds of animated video from an image.

Is that a fair price to pay to generate a graphic that is highly customized to one’s target audience for example? I can’t answer this question right now, however in this blog post you will be able to watch a simple and informal side by side comparison of the performance of SVD 1.0 vs 1.1, and get some inspiration out of the exercise.

Given that the 1.1 weights were obtained by just fine-tuning 1.0, it begs the question: will there be much difference? Which one is outright best? Let’s find out.

Side by Side comparisons

To help illustrate our findings, I have created a set of videos with both SVD 1.0 and 1.1, having the same input images and the same micro-conditioning parameters.

For reference we have fixed the micro-conditioning parameters to the following values:

# Guides how much motion to add.
motion_bucket_id: int = 127

# Guides how much the video should resemble the input image.
noise_aug_strength: float = 0.1

# Your good ol frames per second.
fps: int = 10
Side by side comparisons of the videos generated with SVD 1.0 and 1.1.

Performance Review

As mentioned above, I wanted to have my first impressions and just take the models out for an informal spin. My highlevel takeaway is that at least with these parameters SVD 1.1 seems to outperform with more realistic images, whereas SVD 1.0 seems to take the lead with artistics images. To make it more fun I ran a quick survey internally about which videos were the best, and these were the results provided by my fellow Octonaut’s:

Results of quick informal polling within OctoAI’s staff.

After watching the video, what are your impressions? Would you agree with this generalization?

Could this have been a side effect of the fine-tune? This is also a good reminder about how important it is to be able to customize the model beyond micro-conditioning parameters, at the latent layer, like we can do with Stable Diffusion with LoRAs, Control-Nets, VAEs, or Textual Inversions (see our previous OctoAI post about it here).

Conclusion

While SVD today is not on par with some revolutionary advances we have seen in the recent weeks I think it is still worth looking at as there are business tasks and use cases that could benefit from having an efficient image-to-video generation pipeline. Consider that one does not need the capability to create a feature length movie to make a short presentation more appealing, or help an educational slide to deliver a more effective message, or bring motion to a product photography to convert more buyers.

There are differences in performance between SVD 1.0 and 1.1, and there isn’t a clear winner. I am looking forward to seeing what customization capabilities we will be likely to get for SVD, hopefully similar to MotionLoRAs for AnimateDiff. Also, I will be working on a more comprehensive profiling in the coming weeks. Did you find a good combination of parameters? Is there something you would like to see? Do let me know!

Finally, we at OctoAI are very excited with the possibilities, and as such, are preparing something very special in the media generation space! If you’d like to engage with the broader OctoAI community and teams, don’t forget to join us in our Discord server.

--

--