Techniques behind OpenAI Sora

6 min readFeb 19, 2024

Sora = Video DiT = [VAE Encoder + ViT + Conditional Diffusion+ DiT Block + VAE decoder]

What is Sora?

Sora is an AI model that can create realistic and imaginative scenes from text instructions.

You can check it out at https://openai.com/sora.

It is a state-of-the-art (SOTA) text-to-video model that can generate high-quality, high-fidelity 1-minute videos with different aspect ratios and resolutions.

Techniques behind Sora

OpenAI has also released the technical report. Some take on Sora.

Sora is built on DiT diffusion transformer model (Scalable Diffusion Models with Transformers, ICCV 2023)
Sora has visual patches for generative models (ViT patches for video inputs)
“Video compressor network”, (Visual Encoder and Decoder, probably VAE)
Scaling transformers (Sora has proven that diffusion transformers scale effectively)
1920x1080p videos for training (no cropping)
re-captioning (OpenAI DALL·E 3) and text extending (OpenAI GPT)

Sora Possible Architecture

Sora = Video DiT = [VAE Encoder + ViT + Conditional Diffusion+ DiT Block + VAE decoder]

From OpenAI Sora technical report and Saining Xie’s twitter, we can tell that Sora is based on Diffusion Transformer Models. It leverages a lot from DiT, ViT, and Diffusion Models without many fancy pieces of stuff.

Before Sora, it was unclear if long-form consistency could be achieved. Usually, these kinds of models can only generate 256*256 videos of several seconds. “We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.” Sora has shown this long-form consistency could be achieved with end-to-end training on maybe internet-scale data.

From Diffusion Model to DIT and Sora

Diffusion transformer (DiT) model is introduced from (Scalable Diffusion Models with Transformers, ICCV 2023).

Basiclly, Diffusion transformer (DiT) is a Diffusion model with Transformer (instead of U-Net).

A typical Diffusion Model looks like below (High-Resolution Image Synthesis with Latent Diffusion Models):

Diffusion models are generative models that generate high-resolution images of varying quality. They work by gradually adding Gaussian noise to the original data in the forward diffusion process(adding noise) and then learning to remove the noise in the reverse diffusion process(denoising).

Diffusion transformer (DiT) model is based on Diffusion model and replace the U-Net with Transformers. To process visual data, Diffusion transformer (DiT) also leverages the classic Vision Transformer (ViT) and rearranges the 2D visual features into a sequence. Graph below shows how Patch + Position Embedding works in Vision Transformer (ViT).

Like Diffusion transformer (DiT), Sora also use Patch + Position Embedding to rearrange the 2D visual features into a sequence.

Based on these techniques from Diffusion transformer (DiT), Diffusion Model and Vision Transformer (*ViT*), we can figure out the Sora Architecture like:

Sora = Video DiT = [VAE Encoder + ViT + Conditional Diffusion+ DiT Block + VAE decoder]

Sora has no big difference with DiT, but Sora has proven its long-term spacetime consistency by scaling up its model and training size.

Also, there are still a lot of unknown tech details behind Sora.

Other Techiques

Scaling Transformers

“Importantly, Sora is a diffusion transformer. 26 Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling, 13, 14 computer vision, 15, 16, 17, 18 and image generation. 27, 28, 29”

Graph below shows base compute(left), 4x compute(middle), 32x compute(right). We can tell 32x compute has much better results than base compute.

Sora has proven that diffusion transformers scale effectively as video models as well. When training compute increases, sample quality improves markedly.

Native-size training data like 1920x1080p videos

“We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right) have improved framing.”

For video models, training on native-size data could improve framing.

DALLE3 re-captioning

“We apply the re-captioning technique introduced in DALL·E 3 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.”

“Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high-quality videos that accurately follow user prompts.”

What is Visual Patches?

“Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.”

Visual Patches here are visual features generated from the Visual Encoder (VAE) and then patched and positioned into a sequence of patches (tokens).

“Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens.”

Visual Patches are also spacetime patches since the input visual data are videos.

Visual Encoder and Decoder

OpenAI calls them “Video compressor network”. “We train a network that reduces the dimensionality of visual data.”

From Saining Xie’s twitter, looks like “Video compressor network” is just a VAE but trained on raw video data.

Takeaways

Sora = Video DiT = [VAE Encoder + ViT + Conditional Diffusion+ DiT Block + VAE decoder]

Sora has proven the scalability of Diffusion transformer (DiT) like models. Based on these model setups, similar text2video results are achievable with the scaling GPUs and training data.

However, to achieve the long-term spacetime consistency shown by Sora, there are still a lot unknown techniques.

Sora is the ChatGPT moment in the video generation industry, I believe similar models will be released soon from one of the top companies (Meta, Google, Amazon). There might also be a LLAMA moment happening in this industry if they open-source it.

BTW, when Sora came out, the video generation architecture from Pika and Runaway would definitely be out-of-date. They must be hurrying switching their architecture to Sora-like models. For me, LLAMA moment on video generation would be more exciting.