How Does DALL·E 2 Work?

Published in

Augmented AI

8 min readApr 27, 2022

At the beginning of 2021, OpenAI released an AI system called DALL·E that could generate realistic images from the description of the scene or object. The generator’s name was a frankenword coined after combining the artist Salvador Dali and the robot WALL·E from the Pixar movie of the same name. Within days, it had taken the world of computer vision and artificial intelligence by storm.

Recently OpenAI introduced DALL·E’s successor, DALL·E 2. A more versatile and efficient generative system capable of producing higher resolution images. Compared to DALL·E’s 12-billion parameters, DALL·E 2 works on a 3.5-billion parameter model and another 1.5-billion parameter model to enhance the resolution of its images.

A significant addition to DALL·E 2 is its ability to realistically edit and retouch photos using “inpainting”. Users can input a text prompt for the desired change and select an area on the image they want to edit. Within seconds, DALL·E 2 produces several options to choose from.

Notice how the inpainted objects have proper shadow and lighting? This demonstrates DALL·E 2's enhanced ability to understand the global relationships between different objects and the environment in the image. Something the original DALL·E system struggled with.

In addition to text-to-image generation, DALL·E 2 can take an image and create different variations of it inspired by the original.

But what’s behind this understanding of our complex world? What makes DALL·E 2 tick? Let’s take a look inside and see how it works.

Architecture & Approach Overview

DALL·E 2 image generation process (Source: Author)

Here’s a quick rundown of the DALL·E 2 text-to-image generation process:

A text encoder takes the text prompt and generates text embeddings. These text embeddings serve as the input for a model called the prior, which generates the corresponding image embeddings. Finally, an image decoder model generates an actual image from the embeddings.

Sounds straightforward, but how does each of these steps actually work?

The text and image embeddings used by DALL·E 2 come from another network created by OpenAI called CLIP. So to understand how CLIP is used in DALL·E 2, let’s first take a quick look into what CLIP is and how it works.

Understanding CLIP: Connecting Textual & Visual Information

CLIP (Contrastive Language-Image Pre-training) is a neural network model that returns the best caption for a given image. It basically does the opposite of DALL·E 2’s text-to-image generation. Instead of having a predictive objective like predicting an image or classifying an image, CLIP has the contrastive objective of learning the connection between the textual and visual representations of the same object.

CLIP Training

The basic idea of the CLIP training is simple:

Generate the image and text encoding of each of the image-caption pairs.
Calculate the cosine similarity of each (image, text) embedding pair.
Iteratively minimize the cosine similarity between incorrect image-caption pairs, and maximize the cosine similarity between the correct image-caption pairs.

Once the training is complete, the CLIP model is frozen and DALL·E 2 moves on to the next task: finding the appropriate image CLIP embeddings for a text prompt.

Connecting Textual Semantics to Corresponding Visual Semantics

Although the text and image encoding intermediates generated are CLIP embeddings, it’s not the CLIP encoder that creates the image embeddings. DALL·E 2 uses another model, called the prior, to generate CLIP image embeddings based on the text embeddings generated by the CLIP text encoder.

The DALL·E 2 researchers tried two options for the prior: an Autoregressive prior and a Diffusion prior. Both of the choices yielded comparable performance, but the Diffusion model is more computationally efficient. Therefore, it was selected as the prior of choice for DALL·E 2.

In case you’re not familiar with them, here’s a brief overview of diffusion models.

Diffusion Models

Diffusion models are transformer-based generative models. They take a piece of data, for example, a photo, and gradually add noise over timesteps, until it is not recognizable. And from that point, they try to reconstruct the image to its original form. In doing so, they learn how to generate images or any other kind of data.

Back to DALL·E 2

Why Waste Time With A Prior Model?

At this point, a question may come to your mind: “Why bother with the prior at all?”

Well, the authors thought the same thing. So they ran some experiments. Let’s take a look at an example from the paper to understand the need for a prior:

For the caption “an oil painting of a corgi wearing a party hat” —

Passing the caption directly to the decoder (Source: Author)

Passing the caption directly to the decoder gives this image of a human wearing a hat.

Passing the CLIP embedding to the decoder (Source: Author)

And passing the CLIP embedding to the decoder gives this image of a partly out-of-frame corgi.

Using the prior generated image embedding (Source: Author)

Finally, using the prior generated image embeddings gives a better, more complete image.

Although the CLIP text embeddings manage to generate acceptable results, removing the prior results in the loss of DALL·E 2's ability to generate variations of images. (more on this later)

Now, let’s move on to the decoder.

Generating Image From Image Embeddings

In DALL·E 2, the decoder is yet another model created by OpenAI called GLIDE(Guided Language to Image Diffusion for Generation and Editing). GLIDE is a modified diffusion model. What sets it apart from the pure diffusion models is the inclusion of textual information.

GLIDE

A Diffusion Model starts from randomly sampled Gaussian noise so there is no way to guide this process to generate specific images. For instance, a Diffusion Model trained on a dog dataset will consistently generate photorealistic dog images. But what if someone wanted to generate a specific breed of dogs?

GLIDE builds on the generative success of Diffusion Models by augmenting the training process with additional textual embeddings. This results in text-conditional image generation. It’s this modified GLIDE model that enables DALL·E 2 to edit images using text prompts.

DALL·E 2’s modified GLIDE adds the caption CLIP embeddings (Source: Author)

The GLIDE model used as the decoder in DALL·E 2 is slightly modified. It not only includes the text information, but it also includes the CLIP embeddings.

Upsampling

After a preliminary 64x64 pixels image is generated by the decoder , it goes through two up-sampling steps to create a higher resolution 1024x1024 pixels image.

How Does DALL·E 2 Make The Variations?

To make variations of an image, you keep the main elements and the style, and play around with the trivial details.

How DALL·E 2 generates image variations (Source: Author)

DALL·E 2 creates image variations by obtaining the image’s CLIP embeddings and running them through the Diffusion decoder. An interesting byproduct of this process is an insight into what details are learned by the models and what details are missed.

Limitations & Bias

Images generated for the prompt “*A sign that says deep learning*” (Source: DALL·E 2 Paper [3])

As phenomenal as DALL·E 2 is, it still has some limitations. Firstly, it’s not yet good at generating images with coherent text. For instance when asked to generate images with the prompt “A sign that says deep learning”, it produces the above images with gibberish.

Images generated for the prompt “a red cube on top of a blue cube” (Source: DALL·E 2 Paper [3])

Moreover, it’s not good at associating attributes with objects. When tasked with generating an image of “a red cube on top of a blue cube”, it tends to confuse which cube needs to be red and which one needs to be blue.

Images generated for the prompt “Times Square” (Source: DALL·E 2 Paper [3])

Another area where DALL·E 2 fails to perform is the generation of a complicated scene. When the authors tried to generate images of “Times Square”, DALL·E 2 generated billboards without any comprehensible details.

Besides the image-generation related limitations, DALL·E 2 also has inherent biases due to the skewed nature of data collected from the internet. It has gender-biased occupation representations, and it generates predominantly western features for many prompts.

Last Epoch

From a research perspective, DALL·E 2 reaffirms the supremacy of transformer models when it comes to large-scale datasets given their exceptional parallelizability. In addition to that, DALL·E 2 demonstrates the effectiveness of Diffusion Models by using them in both the prior and decoder networks.

DALL•E generated images for the prompt “an armchair in the shape of an avocado” (Source: DALL·E blog)

In terms of applications, I still don’t think we’ll see a lot of DALL·E 2 generated images being used commercially due to the bias implications. But that doesn’t mean it won’t see any use at all. One crucial application is the generation of synthetic data for adversarial learning. After all, how many images of an “avocado-shaped armchair” can you find laying around?

Thanks to its inpainting abilities, another promising area of application is image editing. We might just see text-based image editing features in our smartphones. Think Google Pixel’s Magic Eraser, on steroids.

Here’s what the folks at OpenAI have to say about it:

Our hope is that DALL·E 2 will empower people to express themselves creatively. DALL·E 2 also helps us understand how advanced Al systems see and understand our world, which is critical to our mission of creating Al that benefits humanity.

What do you think about DALL·E 2? Do you think it will replace illustrators anytime soon?

References:

[1] DALL·E 2 paper
[2] OpenAI Blog
[3] CLIP Paper
[4] 2021 DALL·E Paper
[5] Diffusion Model Paper
[6] GLIDE Paper