Exploring the Core: A Technical Deep Dive into OpenAI’s Sora Model

Published in

The Deep Hub

7 min readFeb 19, 2024

In recent days, the tech community has been abuzz with news of OpenAI’s groundbreaking model, Sora, capable of generating one-minute videos that surpass the capabilities of existing video generation models. Its remarkable effectiveness has drawn widespread acclaim. Intriguingly, OpenAI did not adopt a novel model architecture for Sora. Similar to GPT, the Sora model is based on a combination of Diffusion and Transformer structures, expanding the model’s scale and incorporating a more extensive dataset. This approach, reminiscent of the leap achieved with GPT-3, has yielded the formidable tool that is Sora. Below, we attempt to decode the principles behind Sora, drawing on the information released.

Transformer Structure

It’s well-known that the Transformer architecture, introduced by Google in their seminal paper “Attention is All You Need,” has proven exceptionally effective in understanding language. Here’s a brief overview of this structure:

Sentences are first converted into word Tokens (the smallest unit of meaning in language, e.g., “Loves” becomes “love”).
These tokens are then transformed into an Embedding vector augmented with token position information and fed into the model.

Exploring the Core: A Technical Deep Dive into OpenAI’s Sora Model

Written by RickyYang