Open AI SORA

3 min readFeb 27, 2024

On 15 Feb 2024, Open AI released a new video generation model called SORA. It allows users to create video from text and is light-years ahead of anything I’ve seen before. For a relative neophyte and AI newbie, I wanted to jot down my notes on what it is and why I think it’s particularly ground-breaking.

What’s a diffusion transformer model?

Apparently, there are different types of transformer models. My understanding is that transformers are a subset of Large Language Models (LLMs). There main use is to change or transform an input into a coherent output. The best known application would be chat bots that give the user the feeling that they are conversing with another person. They take questions and return answers to the person asking the question using context, but they also provide subtle detail that remind us of human interactions.

Diffusion models take random noise and iterate towards a cogent sample. This is done with the help of computer vision models that have been trained on huge data sets. Imagine having every picture ever made and then breaking them into puzzle pieces. With enough time, you can make a mosaic of one image with inputs from a variety of other images.

When you repeat this process over and over, you can recreate an image that is believable. For example, the image below looks makes sense at a certain level, even if the individual components are not part of the original image.

So if you combine a transformer with a diffusion model and apply to video, you end up with a tool that can create any image with text prompts.

What makes SORA really powerful is the ability to create videos of different rations, resolutions, colors and shadings. For example, watching an animation, we intuitively know how physics works. Heavy objects behave differently than light ones. To pass off as believable, you need to see objects to move as they would in the real world.

The publicly available video of SORAs works has some flaws, but at first blush, most of the videos could pass as drone footage or other type of video.

This is particularly exciting for anyone working with video or related media. We do need to take care that deep fakes do not become common place. Not everyone will be as responsible as Open AI and clearly label their work as fiction.

However, just like we can go to the cinema and enjoy a sci-fi movie without wondering if actors really went to outer space, we should be able to have a future that benefits from the use of GenAI in video without worrying that we’ve been had.

References

The Illustrated Stable Diffusion

Translations: Chinese, Vietnamese. (V2 Nov 2022: Updated images for more precise description of forward diffusion. A…

jalammar.github.io

Video generation models as world simulators

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion…

openai.com

Open AI SORA

The Illustrated Stable Diffusion

Translations: Chinese, Vietnamese. (V2 Nov 2022: Updated images for more precise description of forward diffusion. A…

Video generation models as world simulators

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion…

Written by Stephen Jacob