OpenAI Unveils Sora

ReadyAI.org
ReadyAI.org
Published in
6 min readFeb 22, 2024

A Groundbreaking Generative Video Model, Currently in Limited Safety Testing Phase

https://openai.com/sora

By: Rooz Aliabadi, Ph.D.

OpenAI has developed a unique new generative video model named Sora (which translates to “sky” in Japanese), which can transform brief text descriptions into detailed, high-definition video clips of up to one minute in length. Through the release of sample videos, the San Francisco-based company has significantly advanced the capabilities of text-to-video generation. The team behind OpenAI Sora and others view the creation of models that comprehend video and the complicated interactions within our world as a critical milestone for the development of future AI systems.

The initial generative models capable of creating video from text snippets emerged in late 2022. However, the early iterations from Meta, Google, and a startup named Runway exhibited glitches and low-resolution issues. Since that time, the technology has rapidly improved. Runway’s second-generation model, launched last year, can now generate brief clips that nearly rival the quality of significant studio animations. Yet, most of these clips remain limited to just a few seconds.

Gen-2 by Runway

The demonstration videos from OpenAI’s Sora feature high-definition and intricate details. OpenAI also says it is capable of producing videos lasting up to a minute. An example video depicting a street scene in Tokyo demonstrates Sora’s understanding of how objects are arranged in three dimensions, with the camera diving into the scene to track a couple walking by a series of storefronts.

OpenAI also claims that Sora effectively manages occlusion. A common issue with current models is their inability to maintain continuity for objects once they are no longer in view. For instance, if a truck moves in front of a street sign, there’s a chance the sign may not reappear as it should afterward.

Intro to Sora by OpenAI

In a video showcasing an underwater scene created from papercraft, Sora has introduced what appear to be seamless transitions between various segments of footage while ensuring a uniform style throughout.

It has its challenges. In the Tokyo video, cars on the left appear smaller than the pedestrians walking next to them. Additionally, they intermittently appear and disappear among the tree branches. There is room for improvement in maintaining long-term consistency. For instance, if a character disappears from the scene for an extended period, they don’t reappear, as the model seemingly loses track of their continued presence. This issue represents a significant challenge.

As remarkable as they are, the showcased sample videos were likely selected to highlight Sora’s optimal performance. Without additional details, it isn’t easy to gauge the accuracy of these examples in reflecting the model’s standard output. Sora’s full capabilities may remain under wraps for a while. OpenAI’s unveiling of Sora today is a compelling glimpse into the technology, with the company indicating no immediate intention to make it publicly available. Instead, OpenAI is set to start collaborating with external safety evaluators to examine the model starting today.

Specifically, the company is concerned about the risks of creating counterfeit yet photorealistic videos. They are proceeding cautiously regarding deployment, ensuring all precautions are taken before making this technology accessible to the broader public.

However, OpenAI is considering a product release in the future. In addition to safety evaluators, the company collaborates with a carefully chosen cohort of videographers and artists to gather insights on optimizing Sora for creative experts. Another aim is to offer a glimpse into the future, providing a sneak peek at the potential capabilities of these models for everyone.

To develop Sora, the team modified the technology used in DALL-E 3, the most recent iteration of OpenAI’s premier text-to-image model. Like other text-to-image models, DALL-E 3 employs a technique known as a diffusion model, which is designed to transform a scatter of random pixels into a coherent image.

Turning visual data into patches

Sora extends this methodology to videos instead of static images. Additionally, the researchers have incorporated another method into the process. Distinct from DALL-E or most other generative video models, Sora integrates its diffusion model with a kind of neural network known as a transformer.

Given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches.

Transformers excel at handling extended data sequences, such as text, which has led to their pivotal role in significant language models like OpenAI’s GPT-4 and Google DeepMind’s Gemini. However, videos consist not of words but of visual information. To adapt videos for processing by transformers, the researchers devised a method to segment videos into portions analogous to textual data. This technique involves slicing the videos across spatial and temporal dimensions, like cutting a stack of video frames into small cubes.

Sample quality improves markedly as training compute increases.

The transformer within Sora can process these video data segments like how a transformer in a large language model processes words within a text block. This innovation allowed them to train Sora on a much wider variety of videos, differing in resolution, length, aspect ratio, and orientation, significantly enhancing the model’s performance. This approach represents an advancement not previously seen in existing research.

From a technical standpoint, this is a considerable advancement. However, there’s a flip side. The enhanced expressiveness allows a broader range of individuals to become storytellers through video. Yet, it also introduces significant potential for misuse. The widespread misuse of deepfake imagery is already a concern, and photorealistic video escalates these issues to a new level.

Individuals might exploit technology of this nature to spread misinformation about conflict areas or demonstrations. The diversity of styles this technology offers is particularly noteworthy. For instance, if individuals create footage with a shaky appearance, akin to something recorded on a smartphone, it would appear more credible.

While the technology hasn’t fully matured yet, generative video has made a remarkable leap from nonexistence to the level of Sora in just 18 months. We are on the verge of entering a realm where content will span fully synthetic creations, human-generated materials, and hybrids of both.

The OpenAI team intends to leverage the safety testing protocols it implemented last year for DALL-E 3. Sora is equipped with a filtering mechanism that screens all model prompts, preventing the generation of content that is violent, sexual, hateful, or depicts recognizable individuals. Additionally, another filter will review produced video frames, blocking any content that breaches OpenAI’s safety guidelines.

OpenAI is also modifying a fake-image detection tool designed initially for DALL-E 3 for use with Sora. Furthermore, the company plans to incorporate industry-standard C2PA tags into all Sora outputs, providing metadata explaining how an image was created. However, these measures are not infallible. The effectiveness of fake-image detectors can be inconsistent, and metadata can be quickly deleted. Moreover, most social media platforms automatically remove such metadata from images upon upload.

It’s essential to prioritize collecting feedback and enhancing our comprehension of the particular dangers tied to video content before contemplating its distribution. By bringing Sora into public discussion now, we enable a collective effort to accrue the insights needed for the crucial groundwork, aiming at a secure introduction of text-to-video technologies. A cautious, open, and curious approach is imperative. Education will be an essential component in navigating these challenges.

This article was written by Rooz Aliabadi, Ph.D. (rooz@readyai.org). Rooz is the CEO (Chief Troublemaker) at ReadyAI.org

To learn more about ReadyAI, visit www.readyai.org or email us at info@readyai.org.

--

--

ReadyAI.org
ReadyAI.org

ReadyAI is the first comprehensive K-12 AI education company to create a complete program to teach AI and empower students to use AI to change the world.