One of the frame of the 4K AI-generated video. We can see that the AI interpreted the word “Satan” in a recognizable way

I had an AI hallucinating over the text of the Bible—Here’s how. [1/4]

A tutorial on how I used four neural networks to generate a 4k, 15 minutes long audiovisual piece — with only the text of the Bible as the input

px studio

--

Preview of the final result of the tutorial

Introduction and general concept

1. The back story

One year ago OpenAI published DALL-E, a paper on how to generate images from text. They didn’t release neither the code nor the model of the entire architecture because they thought it would be too dangerous to give it to the open world. What they did, instead, was to release CLIP — the image descriptor part of it.

This paved the way to a revolution. The brilliant mind of Ryan Murdock used CLIP to create an open source implementation that would work very similarly to the original DALL-E paper. Based on an input text, a neural network capable of generating any kinds of images (eg: BigGAN, VQGAN) would be directed to create something which its description would be close to the input text.

2. How does it work?

To understand the underlying mechanics, let’s take the input text “A dog is transforming into a person”. The image generator neural network (for example, VQGAN) starts by generating a very random noisy image, such as:

Generation at step 1

This image is then fed to CLIP that generate a similarity score based on the picture and the input text. At the beginning this will be very low, because the image doesn’t certainly look like a dog becoming a person, does it? Nope. So then, this similarity score is used to optimize the image generation. In simple words, CLIP tells to VQGAN: do better! and VQGAN obbeys. It generates an image that in the eyes of CLIP is now more similar to the input text.

This process is iterated several hundred of times, and this is what we get along the way:

Generation at step 50
Generation at step 150
Generation at step 500

I stopped the generation at step 500, but you get the idea: over time, the generation get closer and closer to a dog becoming a person. In a sense, we can say that this is what the AI imagines when thinking about the concept of dogs transforming into people. Pretty fascinating, right?

3. Applying the concept to sacred texts

One year ago, when I saw the implementation of Ryan Murdock, I asked myself: what would the AI think and imagine if I would feed it the text of the Bible? I thought it would be really fascinating to see a machine trying to visualize and make sense of one of the most ancient and influencial book in human history — a book so important that shaped the concept of morality, of right and wrong, and influenced customs and traditions all over the world while still holding today an incredible power over society.

I took the original implementation of Ryan Murdock, modified it a bit (here the very messy code, beware), and used it to generate this:

It was one the first text-based AI-generated video ever created: it got quite popular on reddit, and couple of months later it was selected to be displayed in two media art festivals in Berlin. For these exhibitions, I wanted to create a longer and higher quality version of it — and I eventually generated the video you could see in the beginning of the article. In this tutorial I will show step by step how I got there.

5. Conclusion

I hope you have a clear idea of how the process work, and I hope you’re ready to generate your own video! If so, check out the second part of the tutorial here.

If you enjoyed this story and you want to read more, follow me on twitter at @p_x_studio where I will post the next parts of the tutorial soon.

Finally, if you have any question or comment: please drop me a message on Twitter, or leave a comment! I will be happy to answer :) a hug and see you soon!

--

--

px studio

Pietro Bolcato aka px is a multidisciplinary new media artist exploring the relationship between society, algorithms, and art.