OpenAI’s DALL-E and CLIP-guided Text-to-Image Explained

6 min readJan 1, 2023

Note: This blog post was initially written back in mid-2021 prior to the release of DALL-E 2, Stable Diffusion, and Midjourney as a companion to this GitHub repository and notebook. Thus, it is a bit outdated. But you can still check it out if you are interested in exploring how the landscape of text-to-image generation was back then.

On January 5th of 2021, OpenAI released a blog post introducing their new deep learning model DALL-E [1], a transformer language model trained to generate images from text captions. A few months after, they published the paper Zero-Shot Text-to-Image Generation describing their approach with creating this model along with code to replicate the discrete Variational Auto Encoder (dVAE) used in their research.

Zero-Shot Text-to-Image generation refers to the concept of generating an image from a text input in a way that makes the image consistent with the text. If the prompt “A giraffe wearing a red scarf” is given, then one would expect the output to be — well — an image that looks like a giraffe with a red piece of cloth around its neck. The Zero-Shot part comes from the fact that the model wasn’t explicitly trained with a fixed set of text prompts, meaning that it can be used with any text input you can think of .

How does DALL-E work?

DALL-E is a language model that is, at its core, an autoregressive network with 12 billion parameters trained on 250 million image-text pairs. In the paper, they explained the methodology used to make this model by dividing it into two parts to describe the two stages of learning they had to model:

The first stage was about learning the vocabulary of the image-text pairs. What they did is to train a discrete Variational Auto-Encoder (VAE) to compress the 256x256x3 training images into 32x32 grids of discrete image tokens of vocabulary size 8192. That is, they learned to map and reconstruct an image to and from an embedding (or latent) space of 32*32=1024 integers (image tokens).

Example of a VQ-VAE taken from Van den Oord et al. 2017 [2]

The second stage was about “learning the prior distribution over the text and image tokens”. What they did here is to concatenate 256 tokens obtained from encoding the input text prompts with the encoded 1024 tokens from their corresponding image and training a transformer to model this autoregressively as a single stream of data of 1024+256 = 1080 tokens. The result is that from an initial set of at least 256 tokens, the model will “autocomplete” the remaining ones such that an image is generated that is consistent with the initial tokens [3].

In summary, with the dVAE from the first stage and the autoregressive transformer from the second one, a single step of DALL-E would have to (1) use the transformer to predict the following 1024 image tokens from the first 256 tokens obtained from the input text-prompt and (2) take the full stream of 1024 image tokens that are generated by the transformer and generate an image using the dVAE to map from the embedding space onto the image space.

DALL-E Results

The results published in their blog and paper show an extremely good capability of generating completely new images that are coherent to the input text prompt. The model is also capable of reconstructing images that have their bottom part missing or understanding the relationship between a given top image and generating a new image from it at the bottom.

DALL-E Output Examples taken from the blog and paper

Implementation of other Text-to-Image Generation Schemes using OpenAI’s CLIP

Even though a lot of people would love to play with DALL-E and/or see more of it in action, OpenAI hasn’t fully released it to the public yet. What they did do is release the dVAE described in the first stage of their paper. But, even thought it can be used to map and reconstruct existing images to and from its latent space, is missing the important part that is actually able to represent text as images (the transformer).

Additionally, for most people and companies it is prohibitively expensive to attempt to train a model as large as DALL-E for themselves (would cost more than a hundred thousand of dollars to train such a model!). Because of that and until they release the full model (if ever), we are bound to look or come up with other schemes that are able to do text-to-image generation in a different way.

Ryan Murdoch (@advadnoun on Twitter) is one that has come up with a simple scheme to accomplish this. He came up with a method to combine OpenAI’s own CLIP with any image generative model (like DALL-E’s dVAE), to generate images from any text prompt.

Text-to-Image generation with CLIP

CLIP was introduced by OpenAI in another blog post the same day that they introduced DALL-E. CLIP is another transformer-based neural network trained on thousands of millions of images from the internet that was basically designed to be really good at telling if an image and a text pair fit together.

Given an image and any set of text labels, CLIP will output how likely each label is to represent the image. So if you show CLIP an image of a cat and the labels ["a dog","a giraffe","a house", "a cat"] it will assign more probability to the labels related to the cat picture (a cat in this case).

CLIP is really good at telling whether an image fits a text label

The beauty about CLIP is that the part that “discriminates” image-text pairs is fully differentiable, so if we have an image generator that feeds every image that it creates to it and define our loss function as obtaining a high value from it, the “error” between the given label(s) and image can be backpropagated through any image generator and incrementally get closer and closer to an image that CLIP is “happy” with

So if we start with an image obtained from a differentiable image generator (it can be random or just noise), we need to traverse through the embedding space in the direction that minimizes CLIP’s error until we get to an image that is good enough at emulating the text (by CLIP standards).

Backpropagating through CLIP and the generator network (Image Source: Yannic Kilcher’s — What Happens when OpenAI’s CLIP meets BigGAN)

The following are some examples of media I’ve been able to generate with this method:

“A city landscape in the style of Van Gogh” (left) — Selfie of me + the text “A cat” (right)

References:

Zero-Shot Text-to-Image Generation: https://paperswithcode.com/paper/zero-shot-text-to-image-generation (Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever)
OpenAI CLIP: https://github.com/openai/CLIP (Alec Radford, Jong Wook Kim,Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever)
CompVis Taming Transformers: https://github.com/CompVis/taming-transformers (Patrick Esser, Robin Rombach, Bjorn Ommer)
Ryan Murdoch’s work (@advadnoun on Twitter). Most of the code implementations here are taken and/or adapted from some of his notebooks.
OpenAI DALL-E’s dVAE: https://github.com/openai/DALL-E/ (Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever)