Late to the AI Art Party and Playing Catch-Up?
An intro to tools and concepts making text-to-video happen and the potential dangers of using it
With the public releases of DALL·E 2, Midjourney, and Stable Diffusion, 2022 is shaping up to be the year of visual synthetic media.
In late August this year, a tweet with shockingly good text-to-video ML demo kept appearing on my Twitter feed. Turns out, it was a demo video showcasing a text-to-video editing feature, to be added soon to RunwayML:
It’s such an intriguing video to me because unlike a lot of AI art and synthetic media floating around, this particular example has aspects of synthetic media, but it is not entirely synthetic.
For media to be considered synthetic, some or all of the piece has to have been generated by a machine, usually by use of artificial intelligence algorithms. This example created by combining source footage of a man, playing tennis on a clay court, and a series of images (or really a video) created by Stable Diffusion’s text-to-image ML model.
Body Segmentation Models Remove Backgrounds
In order to combine synthetic with non-synthetic media seamlessly, we need to find points on which to combine the two. In this particular example, the real footage of the tennis player was kept whereas the background was replaced with synthetic media.
To extract the tennis player, it’s likely that a real-time body segmentation neural network was used. ML models, such as TensorFlow.js, can detect head, body, and hand poses from large datasets of body parts. In January 2022, MediaPipe and TensorFlow.js combined forces to create a body segmentation model that allows you to much more accurately identify segments of the body. With this information, you can easily remove the background in a way just as effective as chromakeying.
It sounds complicated, but nowadays you don’t even have to know how to code in order to key out a subject.
New tools like RunwayML have background removers that allow you to simply use a brush tool to select your subject in one keyframe and watch its AI tooling isolate your chosen subject in subsequent frames.
Hell, even our everyday tools like Zoom have been getting better and better at doing this.
So now that we’ve demystified how to extract out complex movement of a character without chromakeying, onto the text-to-video background.
Stable Diffusion Generates Images from Text
The original tennis player video is produced using Stable Diffusion, an ML model that generates images from descriptions using NLP. After over 150,000 GPU-hours of training, it was finally released in August 22, 2022.
Unlike most other text-to-image ML models, Stable Diffusion is fully open source under a Creative ML OpenRAIL license. It can be used both on the cloud and on your desktop, though it has pretty steep operating requirements.
Like many others like DALLE, Midjourney, and Imagen, it is an implementation of the Latent Diffusion Model, which is the common basis for other text-to-image ML models.
To make a good text-to-image ML model, a lot of things have to work in conjunction.
You not only need an enormous dataset of labeled images on which to train, but also a good machine learning model to interpret those images to build an algorithm that will produce images in the future. Then, you typically need to process the text fed into the prompt using natural language processing to understand the essence of the request and what relative concepts there might be.
So let’s see what we have here.
LAION-5B Dataset
Stable Diffusion uses the LAION-5B dataset, which contains over 5 billion 512 × 512 CLIP-filtered image-text pairs of varying languages. The dataset is uncurated and rather democratized, so it’s recommended to be only used for research purposes.
CLIP is multimodal model created by OpenAI that combines knowledge of language concepts with semantic knowledge of images. That is, it has a generalized knowledge of individual words and what they may represent in pixels. If you feed it an image of a skyscraper and a straw hut, it will be able to tell you with relatively high certainty that both are buildings.
Training on CLIP-filtered image-text pairs means that the dataset is easily filtered, or searched, because of CLIP’s ability to discern images with text prompts.
This is a really great article about CLIP, would recommend reading it!
Latent Diffusion Model
If I told you, a human, to identify cathedrals in a series of photos, you might first think of characteristics, like type of architecture, spires, steeples, gargoyles, church bells, arches, statues of religious figures, etc. Maybe you’d remember a few specific ones, like Notre Dame, and compare pictures to that.
We could do that using a computer. We have plenty of ML models that can detect objects, but with this method, it’s hard to detect aesthetic and gist of a subject.
However, another way you might remember cathedrals, is purely by their essence. Imagine losing your glasses and seeing what looks like it could be a cathedral in the far distance. You might still recognize it by the fuzzy shape, color, hustle bustle of the town square, and other imprecise characteristics.
This is closer to what the Latent Diffusion Model does.
It takes an image and adds random noise until it is unrecognizable. Then it removes noise iteratively to get back to the source image, learning what exact parameters it had to use to remove the noise. This new knowledge is added to its algorithm.
After doing this process over an enormous dataset, the model ends up with a pretty elaborate algorithm, capable of taking any assortment of noise and de-noising it to create an image. The more it reverses the gaussian noise, the sharper the image becomes.
I’m not sure for Stable Diffusion in general (I didn’t get that far in the source code), but some ML text-to-image models actually use a different de-noising algorithm to refine the aesthetic after a certain threshold has been reached.
In the example above, a source image is given to a Latent Diffusion Model to interpret. It takes an MS paint-like drawing, and adds noise until it is nearly unrecognizable. From there, it revs up the complex algorithm and de-noises. “Oh, I’ve seen a cluster of pixels that look like this before, maybe this is the edge of the brown structure…oh and this green is reminiscent of these other ones in images labeled tree or water…”. At each step, it runs through the algorithm again, passing new parameters and refining its output.
When you do text-to-image, usually a neural network NLP model will interpret your prompt and feed that into the ML model, which remember, has been trained on an absurdly large dataset of not just image-text pairs but CLIP-filtered image-text pairs. It developed a lot of context around the dataset it trained on.
From there, it will start with randomly generated(?) noise and start the de-noising process.
If you want to see a cool demo of this working, you can check out Hugging Face’s Latent Diffusion LAION-400M Demo. You can select at which step you would like to see your image.
Well, now after learning all of this, makes sense why this platform is called Stable Diffusion, doesn’t it?
You can check out Hugging Face to also see the Stable Diffusion implementation itself.
Using Stable Diffusion to get to Text-to-Video
A video, down to its core, is just a series of images. I wasn’t able to find much about the original project the tennis guy is from other than an employee of RunwayML stating on Twitter that they’re working on a plugin using Stable Diffusion. That’s the beauty of open source!
Speaking of open source, there are plenty of projects, like nateraw/stable-diffusion-videos, already using Stable Diffusion to create videos:
Let’s circle all the way back to our original example.
Now we know that with the power of a body segmentation neural network like BlazePose, we can isolate Magic Tennis Man from the rest of the shot, and with the power of Stable Diffusion’s implementation of latent diffusion, we can take a text prompt and generate any imagery.
With the power of open source, it’s no stretch of the imagination to see how we can find a library (or write our own) that uses Stable Diffusion to create a series of images from a textual prompt. Perhaps we increase the size and resolution for the background and produce only one image, and use another ML model to interpolate between points when the tennis player is moving, I’m not exactly sure! But this video is definitely more demystified now and certainly seems possible.
Are there ethical ramifications?
Yes, there are. Hell. There are actually so many!
Around 2014–2016, an AI-driven phenomenon called “Deep fakes” took the world by storm. Using deep learning frameworks, like Generative adversarial networks, users could train an ML model with images and videos of people’s behaviors and likeness to later create synthetic videos or images of that person performing an action.
This became extremely frightening when people started using it to create synthetic videos of politicians or celebrities, making statements that they never made themselves. It became a hot topic of discussion about the ethics of AI driven technology.
Text-to-video ML models suffer from a similar question of ethics. On its own, it doesn’t seem too bad; to make a convincing full fledged video with multiple subjects and a functioning foreground and background seems still a ways off.
However, if we then combine it with body segmentation ML models, we can now take any video of anyone and superimpose them onto an environment with great ease. At its worse, people can pretend they, others, or an event are happening elsewhere and spread fake news or misinformation. It can also speak danger for people working in VFX; what used to be a creative, but painstaking, endeavor can now be done in a fraction of the time, without flexing much of your own imagination.
Imagine a video of someone pretending to be captured to start a war, or two videos of the same event but in two different places. Which should you believe?
What’s scariest though, is the idea of combining deep fakes with full body segmentation with text-to-video. The potential for fear-mongering and spreading misinformation is astronomical. With this technology you can easily frame someone. This can easily make or break someone’s career, or even life. You can act out a crime, using their likeness in place of yours, and change the scenery completely without ever leaving your house. If the setting is fully generated, it will be unrecognizable, so there won’t be any witnesses to see whether it did or didn’t in fact happen.
At what point of an investigation should you assume video evidence may be synthetic?
We already were taught to be wary and dubious of “news” we see in our social media feeds, will we now have to question every piece of media we encounter in the future?
So…
As a technologist, I’m excited and intrigued by this technology. You can tell by how much I wanted to research about it!
As a member of society, I am deeply concerned. I don’t trust people enough to not misuse it in the worst ways possible.
That said, technological progress waits for no one and will plough forward regardless of the consequences. Sometimes we just need to hone in on our own personal experiences, set up our healthy boundaries, and not try to take the weight of society on our shoulders.