Detailed Guide On OpenAI’s SORA

Published in

3DVerse

12 min readFeb 16, 2024

One name that has established itself as a synonymous brand with terms like AI, AI generative models, etc, is OpenAI. Since the company’s start in December 2015, they have been the pinnacle of AI research and development of robust and intricate AI models.

While the company has stayed in the headlines for the past few months for all the wrong reasons, with the whole drama revolving around Sam Altman and Greg Brockman, it looked like maybe the company is bound to collapse someday due to management disagreements and a flurry of legal battles.

But things have since eased out for them as Sam Altman is back in full swing, Greg Brockman has also been reinstated, and Microsoft has been provided with a nonvoting board seat.

Also, they had a tremendous relief this week as the courts have given them a partial win against the author’s copyright claims, but the case is ongoing with still a lot of developments to come.

But to shut all of their critics and doubters out yesterday, OpenAI has officially added a new product to their lineup of AI Models, SORA. OpenAI’s mission is to achieve the goal of training models that help people solve problems that require real-world interaction. Sora is just going to boost their cause.

Past AI Models By OpenAI

Before jumping into Sora, look at OpenAI’s two other AI models deployed globally for general public use, ChatGPT and DALL-E, which are top-of-the-line AI models in their respective fields. With both of them giving their competitors a run for their money, Sora is believed also to establish itself as a critical player in the market.

ChatGPT

ChatGPT was launched on November 30, 2022. It was the first high-stakes product launch of OpenAI as it would shape the colossal entity OpenAI is today. ChatGPT is a chatbot based on the Large Language Model and gives results or outputs to users using prompt engineering available in over 182 countries.

Just to give you a perspective of how much computation and effort went behind making a robust and advanced AI Model like GPT, let’s crunch some numbers:

The GPT 3 model is trained on 175 billion parameters, while the GPT 4 model is trained on more than 1 trillion parameters.
ChatGPT is trained on 300 billion words.
ChatGPT has 570 gigabytes of text data.

People loved ChatGPT so much that it reached its first million users within five days of launch, making it the second fastest platform to get the milestone of a million users. It had the first crown some months before Meta launched threads, which shattered every record in the book by having a million registered users within 2 hours of its launch.

64.53% Of ChatGPT Users Are From The 18 To 34 Age Group, indicating that the youth workforce is already deploying AI solutions in their day-to-day lives to streamline their productivity and output. With ChatGPT reportedly minting a sweet $1 Billion in revenue by 2024, the growth and craze are not looking to plateau soon.

Within 2 Months of its launch, ChatGPT had 100 Million+ active users, which pushed the company’s valuation to $29 Billion.

Today, ChatGPT has 180.5 Million monthly active users and well over 100 million weekly users.

The gamble had paid off for the team behind the iconic chat boat since most of the competition has played a catchup game ever since. Google launched Bard, Microsoft launched Copilot based on ChatGPT4.0 itself, etc.

There are two versions of ChatGPT available as of the time this article was written. GPT3.5 and GPT4.0. Someone unfamiliar with bot would think 4.0 is just a bit better version of 3.5, but oh boy, are they wrong! The problem with GPT3.5 is that even though it takes information from the entire internet, it can’t access it.

If you ask GPT3.5 what today’s stock trends are or any news related to the ongoing day, it won’t be able to since it only has an information update till January 2022.

But it all changed on 14th March 2023, with the launch of ChatGPT4.0; nothing much GPT could now access the WHOLE WIDE INTERNET and a sweet price tag of $20/month.

Gone is the limitation of not being able to stay updated. With the addition of internet access, GPT has become a force to reckon with. Also, you can now upload images, docs, etc, to the GPT chatbox and generate desired results.

DALL.E

DALL.E is a text-to-image converter made by the gigaminds at the coordinates 37.76227598464887 and -122.41462482623245 (OpenAI HQ). Dall-e uses deep learning to generate digital visuals using natural language processing or “prompts.” There are three iterations of Dall. E, DALL.E1, DALL.E2 and DALL.E3.

DALL·E’s model is a multimodal implementation of GPT-3 with 12 billion parameters which “swaps text for pixels,” trained on text–image pairs from the Internet.

Revealed by OpenAI in a blog post on 5 January 2021, and uses a modified version of GPT-3 to generate images.

On 6 April 2022, OpenAI announced DALL·E 2, a successor designed to generate more realistic images at higher resolutions that “can combine concepts, attributes, and styles.”

On 28 September 2022, DALL·E 2 was opened to everyone, and the waitlist requirement was removed. By mid-September 2023, OpenAI announced its successor, DALL·E 3, capable of understanding “significantly more nuance and detail” than previous iterations.

Let’s crunch the numbers of Dall. e too:

Dall. e boasts an active user base of 1.5+ million individuals.
Dall-E produces about two million images per day.
According to a recent large 2024 survey, 70,000+ online businesses and brands are using DALL-E to generate unique new imagery.
DALL-E was trained on approximately 650 million image-text pairs scraped from the Internet.

What is SORA?

As OpenAI has put up on the official website, “Sora is an AI Model that can create realistic and imaginative scenes from text instructions.” If it were for a personal opinion, if we say that we are excited, it would be a vast understatement.

From the moment you go on the landing page of Sora, you see beautiful paper airplanes flying in flocks just like wild parakeets do, and then at the bottom, you read a line: “All videos used in this page were made using Sora.” That is when reality sets in, and you have to sit back and watch this model’s power and capabilities

Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. The rest of the text to AI competitors are far behind what Sora is right now, and it will only get better and better from here on.

Like ChatGPT, prompts are a key player here; you must give exact and concise prompts to get the most out of Sora.

Let us look at an example shown on the official website,

Several giant wooly mammoths approach, treading through a snowy meadow. Their long wooly fur lightly blows in the wind as they walk, snow-covered trees and dramatic snow-capped mountains in the distance; mid-afternoon light with wispy clouds and sun high in the distance creates a warm glow; the low camera view is stunning, capturing the sizeable furry mammal with beautiful photography, depth of field.

And this was the output, isn’t it amazing?

https://cdn.openai.com/sora/videos/wooly-mammoth.mp4

As one can see, it is still not a fully developed product, but it is 80% there. It needs some tweaking and can be a convincing and functioning end product.

As stated on the official website, It has some weaknesses. It may struggle with accurately simulating the physics of a complex scene and may not understand specific instances of cause and effect.

For example, a person might bite a cookie, but afterward, the cookie may not have a bite mark. The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.

How does it work?

The main highlight of Sora is how it interprets the prompt by the user and accounts for how that subject and surrounding environment will behave in a real-time environment. It is built upon the learnings of advanced models like GPT and DALL.E to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background.

Sora can also create multiple shots within a single generated video render, matching the real and videography styles.

If you read the official technical documents, you will derive that Sora is a diffusion model that will generate a video by first making a static noise, which gradually transforms into the final video as directed by the prompt by removing the noise and garbage values over many iterations.

Sora can generate entire videos all at once and even has the computing power to add missing frames and extend a video from before the start or after the ending with seamless integration and not breaking the flow.

There is always an issue with generative art that it is random, which is something you do not want in a video as you want the subject and main aspects to remain the same through and through. Sora has solved this problem by giving the AI Model foresight of many frames at one single time.

https://cdn.openai.com/sora/videos/mitten-astronaut.mp4

Like its already successful siblings, Sora also uses transformer architecture, unlocking enhanced scaling performance. Simplifying and making it easier to understand this process further, Sora designates each video and image as a collection of smaller units of data called patches. Each of these is analogous to a token in GPT. By unifying and tidying up the data, it can train diffusion models on a much larger scale of visual information, spanning across different metrics like duration, aspects, resolutions, etc.

We discovered that Sora uses the recaptioning technique from DALL·E 3, which generates highly descriptive captions for the visual training data. Hence giving outputs which are in strict accordance with the prompt given by the user.

Also, one more AMAZING thing is that Sora can also make a video from an image input.

Technical Analysis of Sora

Our analysis is that Sora is a generalist model of visual data that can generate videos and images of various durations, aspect ratios, and resolutions up to a full minute of high-definition video.
Sora is developed by using other LLM models as an inspiration, which acquires data and information on an unfathomable level. The basis of LLM’s success is using tokens that unify diverse manners of text: code, math, and various natural languages.
Engineers have used visual patches for Sora, which serve the same function as tokens for LLM’s. Prior use of patches has shown exciting results for visual model training. The main advantage gained is scalability and effectiveness for training models, especially models dedicated to images and videos.
At a higher level of computing, the model converts high-quality videos into lower-dimension latent spaces, making it more agile and hence decomposing into patches.
The sampling capacity is also Sora, which can sample widescreen 1920x1080p videos, vertical 1080x1920 videos, and everything in between. This lets Sora directly create content for different devices at their native ratio.
Sora is a diffusion model given input noisy patches (and conditioning information like text prompts); it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.
Sora will play an essential factor, which is its resolution abilities; past approaches to image and video generation typically resize, crop, or trim videos to a standard size as stated by the makers. We find that, instead, training on data at its native size provides several benefits.
It was also stated that training text-to-video generation models requires a large amount of data, particularly visual data with matching text captions. We apply the re-captioning technique introduced in DALL·E
First, a highly descriptive captioner model is trained and then used to produce text captions for all videos in our training set, which again caught my attention.
There are still a lot of technicalities I would like to discuss, but I could not before more paperwork came out and I had more clarity.

Limitations

As stated above and by the officials, there are still some limitations as this is the first launch, and with each iteration, it will only keep improving. Some of the key limitations are:

Sometimes, it creates motions that are physically unlikely or not in relation to the scene.
Entities can spontaneously erupt between the scenes where there are a lot of subjects and moving characters.
Unnatural object morphing is also an issue.
Sora generally fails miserably with complex physics interactions, especially with fluids and fragile materials like glass, ceramics, etc.
Trouble in showcasing emotions on faces or general behavior of the subject.

https://cdn.openai.com/sora/videos/puppy-cloning.mp4

Implications of Sora

The implications of this technology are immense, but a few of the industries that will take firsthand benefit, according to us,

Video editing

One of the critical features of Sora is that it can add missing frames and even extend existing videos seamlessly means that it would be easier for video editors as they don’t need to reshoot a scene just for one shot, which is crucial for the smooth flow of the video but not the story or a similar use case.

Stock Videos

Companies that earn a fortune by providing exact stock videos for use should be scared. Anyone who needs a specific video can now get exactly what they want and do not have to settle for less because they couldn’t find it because it is too niche and doesn’t exist.

Content creation

Film production and the entertainment industry will be exciting avenues. Can you imagine the implications? Someone can make a spinoff show of Dwight Schrute and Jim Halpert’s adventures or a short movie by putting in a detailed prompt. And also, we have VFX, which is closest to the real thing.

One of the things happening is if you create a script from ChatGPT and use Sora to visualize those ideas.

Gaming

As the future of gaming is speculated to shift to cloud gaming one day, just imagine this scenario: an open-world God Of War with procedural generation and the highest level of graphics, too good. Also, if graphics and scenes could be made using Sora, it would make game development much easier, and we wouldn’t have to wait three lifetimes for GTA7.

Visualization

The last and most critical implication of Sora, which can be started from the day it launches, is visualization; we have a medium now to transform our imagination into visuals. Anything you want, however you like, is just one prompt away from you.

What are people saying about Sora?

For some, it is the next giant leap in AI, calling it AI3.0; for some, it is just a sham; we looked around on forums, talked to a few people on Discord, and found some exciting takes for you.

One Redditor says, “I think a tool as powerful as ChatGPT has just been invented here. Personally, I’ve had reservations about AI learning from copyrighted material.
However, I feel that we might be on the precipice of a series of changes so significant that governments that do not allow AI to train off of copyrighted material as the government of Japan has done, may fall prey to a process of natural selection, putting those governments and nations which do allow this technology to flourish 20 years ahead of everyone else. More here”
While another person reminded me of how exponentially quickly AI is growing, this is what he had to say: “ I’m amazed at the sheer rate that AI has changed. 5 years ago (2019, I was a lot younger) was probably the first time I heard about AI. It was on a Guinness World Records episode about this robot that you could interact with, which incorporated AI. It was mind-blowing at the time, but looking back, the stuff it was doing would be seen as mundane in the scope of today. Read the full thread here”

We found out that,

People are excited about this new venture as we could witness the first steps of something game-changing.
People are skeptical about the data sources and question the moral and ethical line.
The main question is, is this the zero point in our timeline from when AI truly started taking jobs?

Where does the moral compass lie?

Obviously, at an age where the #NoAI movement is at full fledge, something like Sora is wholly opposite or borderline similar to what the #NoAI movement was made for. However, Sora has mentioned working with red teamers to ensure no ethical or intellectual property compliance is ignored.

Also, working on a tool to help detect misleading content, such as a detection classifier that can tell when a video was generated by Sora.

They also use the same data protection and safety methods currently used in DALL.E.

OpenAI has updated its terms of usage and blocks inputs that contain anything that is extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others.

They also have image classification methods that will judge every frame rendered against those metrics so that no terms of generation are breached.

It also stated something along the lines of communicating with educators, artists, and policymakers to ensure positive use cases and what concerns they have.

Conclusion

Sora will be adopted widely due to its unique offering of text-to-video, but in the end, it is right to say that it is just too early to say or speculate anything; of course, the first showcases are stunning, but we need to know how easily can we get those results as shown.

This is undoubtedly a turning point in the AI industry as all three AI models of OpenAI together will definitely change many things.