Meta’s new model can turn text prompt into videos

Salvatore Raieli
Geek Culture
Published in
6 min readOct 4, 2022

Make-A-Vide, a new break-through in generative art

Meta Make-A-Video
image generated by the author using OpenAI’s DALL-E 2

A few months ago, the world was surprised by DALL-E2, capable of turning a text prompt into an image in seconds. Last week Meta unveiled its latest Make-A-Video model, which can turn a text prompt into a short video of a few seconds.

Generative art is quickly exploding

DALL-E 2 was announced by OpenAI in April. The new model seemed capable of being able to generate images from text with incredible accuracy.

Meta Make-A-Video
Image source: DALL-E2 original article

The results astonished the world as it seemed that it would be quite some time before it could be overcome. Instead, shortly thereafter Google published two models capable of surpassing DALLE2. Google’s Imagen and Parti were released in a short time and they did not remain state-of-the-art for long either. In fact, a few months ago Stable Diffusion was released.

In the months that followed, the generative field moved quickly and many of these models were opened or released in open source. Recently DALL-E2 was opened to the public (no more waiting list) and announced out-panting capability. DALL-E mini has been incorporated into HuggingFace and even Photoshop now has incorporated stable diffusion.

All of these models take a short textual description as input and generate an image as output. Meta, on the other hand, has announced a model that can turn a text prompt into a short video

Make a video

Make-a-video was announced last week. Meta researchers have also published a scientific paper (arXiv link) where they describe the model in detail. Meta had recently been working on generative models, in fact, it had released this year Make-A-scene, a model capable of creating photorealistic illustrations using words, lines of text, and freeform sketches.

Brief technical description

Meta Make-A-Video
Image source: Meta’s original article

the model is constructed from three main parts:

  • A basic text-to-image model that is trailed on text-image pairs. This part of the model is pretty much the same as what we have seen so far in other text-to-image models (as they write in the article it is pretty much the same as the one used for DALL-E2)
  • spatiotemporal convolution and attention layers that allow us to switch to the temporal dimension. In fact, the authors modified convolution and attention layers to be able to switch from an image to a temporal dimension without increasing the computational cost too much.
  • spatiotemporal networks that allow high frame rates to be generated (this part of the network is used to increase the number of frames in the video to obtain a smoother video)
Meta Make-A-Video
from the article: The architecture and initialization scheme of the Pseudo-3D convolutional and attention layers, enabling the seamless transition of a pre-trained Text-to-Image model to the temporal dimension.

As a dataset, they used 2.3 billion images (a subset of the Laion 5b dataset, a huge dataset of image examples and associated text). The authors describe that they filtered images with toxic words or r images with a watermark probability larger than 0.5. They also used WebVid-10M and HD-VILA-100M or train their video generation model.

Meta Make-A-Video
T2V generation examples. image source: original article

The authors explain that using a dataset of labeled images served to explain to the model what the objects are called and what they look like. In addition, the use of video allowed the model to understand how these objects move in the world. This new way of training the model proved to be efficient in teaching the model how to generate videos from the text.

Learning world dynamics from orders of magnitude more videos using unsupervised learning helps researchers break away from the reliance on labeled data. The presented work has shown how labeled images combined effectively with unlabeled video footage can achieve that. — Original article

What the model can do?

Make-A-Video lets you bring your imagination to life by generating whimsical, one-of-a-kind videos with just a few words or lines of text. — Meta Make-a-Video website

Meta briefly described the capabilities of its new model:

  • Generate video from text, providing a short text the model is able to generate a small video of a few seconds
  • Add motion to images, providing an input image the model is returning a short video
  • Create variations of a video, a user can provide a small video and generate variations

The authors of the article also describe what the future steps will be:

As a next step we plan to address several of the technical limitations. As discussed earlier, our approach can not learn associations between text and phenomenon that can only be inferred in videos. How to incorporate these (e.g., generating a video of a person waving their hand left-to-right or right-to-left), along with generating longer videos, with multiple scenes and events, depicting more detailed stories, is left for future work. — Original article

In fact, some of the videos are quite strange, both the motion and the details are far from perfect. The movements are not so fluid and sometimes the objects interact in a surreal way (as if the model is unable to understand the boundaries of the objects and how they were supposed to interact)

In addition, the researchers tried to eliminate images with associated toxic words from the training dataset. The decision to also use public datasets was motivated by the possibility of increasing transparency about the training process so that anyone can check the data used. Nevertheless, the authors state:

As with all large-scale models trained on data from the web, our models have learnt and likely exaggerated social biases, including harmful ones

Parting thoughts

Make-a-video is just the beginning, as seen in other cases the study of multi-modal models is a trend for the coming years. Several models capable of generating images from text with great quality have come out in a year, so it was only a matter of time before it might be possible to generate short videos.

Although it was expected, this model presents several technically interesting solutions also aimed at reducing the computational cost of generating several frames per prompt.

As has been said for previous models, there are still several unresolved ethical issues for these generative models. The first issue is that these datasets were assembled using crawlers. In fact, they contain works by many artists without permission and without providing acknowledgment. The second problem is that they often contain bias and images with toxic connotations. The first version of OpenAI Dall-E contained several biases that the authors have difficulty correcting.

Potentially these models could be dangerous in the deep-fake video generation, which is why Meta added a watermark to all generated videos.

Meta says this model opens up new opportunities for creators and artists. The results at the technical level are encouraging, although at the moment this first model is limited to short clips. At the moment, Meta has no plans to make it public until the potential risks are mitigated.

If you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium:

--

--

Salvatore Raieli
Geek Culture

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence