Exploring the Core: A Technical Deep Dive into OpenAI’s Sora Model

RickyYang
The Deep Hub
Published in
7 min readFeb 19, 2024

--

In recent days, the tech community has been abuzz with news of OpenAI’s groundbreaking model, Sora, capable of generating one-minute videos that surpass the capabilities of existing video generation models. Its remarkable effectiveness has drawn widespread acclaim. Intriguingly, OpenAI did not adopt a novel model architecture for Sora. Similar to GPT, the Sora model is based on a combination of Diffusion and Transformer structures, expanding the model’s scale and incorporating a more extensive dataset. This approach, reminiscent of the leap achieved with GPT-3, has yielded the formidable tool that is Sora. Below, we attempt to decode the principles behind Sora, drawing on the information released.

Transformer Structure

It’s well-known that the Transformer architecture, introduced by Google in their seminal paper “Attention is All You Need,” has proven exceptionally effective in understanding language. Here’s a brief overview of this structure:

  • Sentences are first converted into word Tokens (the smallest unit of meaning in language, e.g., “Loves” becomes “love”).
  • These tokens are then transformed into an Embedding vector augmented with token position information and fed into the model.

--

--