New AI: Next Level Text to Video! Beyond Text to Image
Before we delve into all the juicy parts of this article, if you would like to watch the video version instead, you can click on the link above.
Text to image generation has been all over the internet and news. Some think the media attention is distracting to the AI community, while others start to anthropomorphize, assigning traits like creativity to these models. There is no denying however, that this technological milestone will transform visual arts forever. A natural progression from image generation is video generation and many people are excited for it. In this article, we’ll discuss some AI models that generate videos from text prompts. Imagine putting the entire text of the Harry Potter series into this model, or maybe a Game of Thrones fan fiction, obviously rewriting the last two seasons since we know how that ended. But before we get ahead of ourselves, let’s discuss the current state of text to video generation.
The current models are very limited in video generation mostly because of two important constraints. Unlike images, videos are represented as multiple frames per time unit, calculated as frames per second. This requires very high computational resources for training video data which is a very big issue. Another issue is the lack of accurate datasets. VATEX, the largest multilingual video description dataset, contains only about 41,250 videos and 825,000 captions in both English and Chinese. The datasets available for video synthesis are either tailored to very specific domains or the captions available do not accurately represent the frames at specific times. Now that we know the challenges facing video generation using text prompts, let’s dive into the current AI models that have been released so far.
We’ll begin with CogVideo. With 9.4 billion parameters, CogVideo is the largest pre trained model for text-to-video generation across multiple domains. It uses what they call a multi-frame-rate hierarchical training technique, with a resolution quality of 480 by 480, at 8 frames per second, for a total of 4 seconds. I hope I didn’t lose you there with too many technical detail. If you have any questions, please leave them in the comments section below and I’ll do well to answer. You can tell the project is in its infancy and there are several directions it can go to improve the videos generated. For now, the model takes in Chinese as input, so English texts have to be translated first. I’ll share the interactive links in the description so you can play around with it.
Next, we have NUWA-Infinity, a collaborative effort between Microsoft Asia and Peking University. The project’s goal was to produce high resolution images and longer form videos. The model is able to perform several visual synthesis tasks with very high quality. Let’s discuss a few of them. The model takes an image as input and can generate a similar image with a very high resolution image of 38,912 by 2,048! That’s about 19 times the size of the original input created by the AI model. The content of the produced image, shows new additions that still fit the context of the original version. The model can also produce high-resolution animation from just a a simple input image. Finally it can also produce Text-to-Video tasks with amazing quality. Look at this incredible Peppa Pig video generated by the model.
Perhaps you can start writing different Peppa Pig scripts for your children or younger siblings so they have an endless supply of episodes. One issue with this model however, is that the datasets used to train this model are narrow in scope and therefore can’t generalize well to other types of videos. This will surely change soon, and we’re definitely keeping an eye on it.
There are other video synthesis projects that are not necessarily text to video, but still deserve an honorary mention. The first is Transframer from Deepmind which is able to generate 30 seconds of video from just a single image provided. It’s great for video prediction and changing views. The resolution quality is however not the highest, but may be improved upon in later iterations. Another state of the art model is from Nvidia Labs. They worked on a video generating model that accurately replicates object motion, camera angle adjustments, and evolving content. The model creates new video content and addresses the issue of long-term inconsistency where scenes may change unrealistically between time frames, for example clouds moving back and forth in an unnatural manner. Moving on, Runway is a video editing platform that announced plans to extend its editing capabilities by using text prompts to change scenes. For example, the link below is a video that changes the background based on the description entered.
https://twitter.com/runwayml/status/1568220303808991232
You can see the background doesn’t show any dynamic objects, nor does the text generate a completely new video from scratch. Maybe future updates may incorporate more complex editing. But for now, this is still a very compelling task. Finally, many creatives are generating videos from interpolating several text to image frames. By adding multiple frames and switching between images, the outcome shows mesmerizing animation results.
You can see we are on the cusp of innovative AI video generation tools. How many months or years do you think we have until we see high quality long form videos from text or image ? Let me know your predictions in the comments section below. Although this is all exciting news for the future, we can’t forget to mention the possible negative societal impacts these sort of video generation tools may have. Like other types of generation tasks, the possibility of misinformation and disinformation is still a huge hurdle in the artificial intelligence community. Copyright issues will be brought to the forefront of this and the laws will have to be updated to reflect the changing times. I can’t imagine Larry and Jerry particularly happy about seeing AI generated videos of their beloved Jerry Seinfeild sitcom. We might still be several years from this scenario but it doesn’t hurt to start thinking about these possibilities. On the other hand, it opens up an endless opportunity for creativity, where anyone can generate videos based on scripts for educational, marketing or entertainment purposes. How would ad agencies or streaming services like Disney adapt to this ? I guess we’ll have to wait and see.
Thanks reading this article. I hope you learned something today.
Thank You!
Resources:
CogVideo
Github: https://github.com/THUDM/CogVideo
Demo Website: https://models.aminer.cn/cogvideo/
HuggingFace Space: https://huggingface.co/spaces/THUDM/CogVideo
Paper: https://arxiv.org/pdf/2205.15868.pdf
NUWA-Infinty
Website: https://nuwa-infinity.microsoft.com/#/
Paper: https://arxiv.org/abs/2207.09814
Others:
Transframer: https://sites.google.com/view/transframer
Nvidia Labs: https://www.timothybrooks.com/tech/long-video-gan/
Runway: https://runwayml.com/