How does DALL-E, the text-to-image generator work?

Mehul Gupta
Data Science in your pocket
7 min readFeb 2, 2023

--

Understanding the deep learning architecture and maths behind DALL-E

Photo by James A. Molnar on Unsplash

You all must have heard of ChatGPT by now if you have been on earth for the last 2–3 months. The AI-driven chatbot has driven the world crazy. But sometime before the launch of ChatGPT, DALL-E introduced the world to the power of Generative modeling and hence I will try to unlock the secret recipe behind the working of DALL-E, the text-to-image converter to which you provide any random description (An Apple eating a man, Iron Man selling veggies in your locality or whatnot), generates an Image for the same. See a sample for yourself from the DALL-E mini

Wowzaaaa

My debut book “LangChain in your Pocket” is out now

But being a Data Scientist, I did wish to deep dive into how DALL-E works and this post is around the behind-the-scenes of DALL-E. But before we jump onto DALL-E technicalities, we need to know

What is Multimodal modeling?

Models that handle multiple diverse datatypes are called Multimodal models. This can be in terms of input, output, or both. For example:

Input: Text, Output: Image (DALL-E)

Input: Text + Tabular data (maybe text classification with metadata), Output: Text

Input: Image/Audio, Output:Text (speech recognition, image captioning)

So, DALL-E is a classic example of Text-to-Image generation where we input a text description and wish to generate an image similar to the description provided.

A big challenge in the case of Multimodal modeling is how to form a bridge between the different modes of data being used. For DALL-E, that would be relating the text representation to an image representation that is understanding the meaning of the words (maybe through word embeddings) and then converting these meaningful word embeddings into objects in images implying the same meaning.

Architecture Overview

The architecture, at a higher level, is quite easy to understand and consists of 3 parts

Text Encoder : To encode text data into text embeddings

Prior: Using text embeddings , generate image embeddings (the bridge)

Decoder: Generating image using image embeddings

The training process, the text-image pair represents the training dataset
Final DALL-E architecture. Observe Image Encoder has been removed

As in the above image, we can see the whole sequence of models used for Training DALL-E and also the final architecture used for Image Generation. No need to worry, we will be talking about each of the segments shortly

Text Encoder

If you are not a beginner in Data Science, you must have heard of text embeddings by now. Significant breakthroughs in NLP be it Transformers, BERT, or GPT are all around generating meaningful numerical representation aka embeddings for text data. How does DALL-E generate text embeddings?

Contrastive Langauge-Image Pre-Training, CLIP

CLIP helps to generate text embeddings using text data using Contrastive learning. Let’s see how

What is Contrastive learning?

Contrastive learning involves learning low-dimension representation of a particular entity (be it text or image) by contrasting between similar and dissimilar objects i.e. it tries to keep the low-dimensional representation of similar object close while keeping dissimilar object’s representation far away.

So, in CLIP’s case

  • A massive dataset with text+image pair is prepared
  • A Text and Image embedding is generated using 2 encoder architectures, one for text and the other for image respectively
  • Now, out of a batch of text+image pairs, CLIP will try to find the right match for every text_embedding & image_embedding by minimizing the distance between right matches, and maximizing the distance between non-matching pairs. The distance/similarity metric used is Cosine-Similarity.
  • The below image depicts how CLIP training is performed to minimize the distance (or say maximize similarity) between diagonal pairs (the right match, blue marked) and maximize for other pairs.
https://arxiv.org/pdf/2103.00020.pdf

The models used for generating embeddings are both Transformer architectures we read in ‘Attention is all you need’. Before wrapping up on CLIP, the concept of Contrastive learning is used for zero/one-shot learning as well (You can read about how Contrastive loss is used for Siamese Network training which implements One-Shot learning), and hence even CLIP can be used for One/Zero-shot learning.

So, as you must have assumed, we would be using the text encoder from CLIP for DALL-E. Does the Image Encoder goes in vain? Not really as we would be using it to train the next segment

Prior

So far we have generated the text embedding. But the difficult part is left, how to transform this text embedding into image embedding? The Prior segment is about this only. One easy approach is to treat this as a Seq2Seq translation (like converting English to Hindi) where we wish to transform text embeddings(1st sequence) to image embeddings(2nd sequence) using a model like Transformers. This is called an AutoRegressive approach.

But DALL-E uses the idea of Diffusion as it outperforms the AutoRegressive approach by a margin

Diffusion

So, this is a concept that has been the talk of the town for generative modeling outperforming GANs at times as well. So what actually is this Diffusion? I would try to give an intuitive explanation though this topic deserves a separate blog post that I would be doing soon. For now, let’s have a brief idea

The idea is simple (at least on paper)

Take up an image

Add some noise to this image after every timestamp ‘t’. This is called Forward Diffusion. Hence you have a comparatively more deteriorated image at any timestamp t compared to timestamp t-1

There would come a point at timestamp T when the image is completely destroyed i.e. just random noise.

Reverse the whole process by reducing noise such that we regain the original image step by step as we added noise. The idea is to get to the image at timestamp ‘t-1’ from the image at timestamp ‘t’ so that gradually we will regain the original image itself using a denoiser model.

https://arxiv.org/pdf/2006.11239.pdf

This is how Diffusion models are trained for generative modeling.

Also, do note that it’s the reverse part that will help us to generate images from random noise. Not deep diving into the mathematics keeping the length of this post in check.

Coming back to DALL-E prior segment, how we are going to use Diffusion for training?

  • We would first concatenate the text and corresponding image embedding generated for the training dataset. The image embeddings would be CLIP’s Image Encoder’s output we trained in the previous step.
  • We will be training a Decoder-only Transformer using the Diffusion method discussed earlier just in the image embedding section i.e. adding random noise at regular intervals to completely destroy the image embedding and then reverse cycle to get back to the original image embedding. In this cycle, the concatenated text embedding remains untouched and is available throughout the model.
  • While generating new images (in the test), we would be directly using the reverse cycle part where image_embeddings would be initialized randomly (so similar to the output of Forward Diffusion) instead of CLIP’s embeddings as shown in the below image
https://arxiv.org/pdf/2006.11239.pdf

Now, let’s move to the 3rd part

Decoder

Decoder’s role is to generate a complete image using image embeddings and text embeddings we got from the reverse cycle of the Diffusion process. The generative model used for DALL-E as a decoder is GLIDE

What is GLIDE now?

So before the inception of DALL-E, GLIDE happened that also tried to generate images from text prompts but without any image embeddings. Hence just the text embeddings are used and there was no concept of a Prior in GLIDE. Let’s look at the steps

  • Generate text embeddings using Transformers
  • Feed it to UNet which is trained using Diffusion as a denoiser to generate images directly from text embedding

What’s a UNet?

Covering this quickly, UNet is a special type of AutoEncoder whose architecture looks like the shape of the letter ‘U’ where the encoder gradually lowers the dimension for the latent space as encoder-network depth increases and the decoder expands the dimension as the depth for the decoder-network increases finally leading to the original dimensions. The architecture below should give you an intuitive idea of UNet. Won’t be deep diving in UNet.

https://arxiv.org/pdf/1505.04597.pdf

Coming back to Decoder, as you must have guessed

  • The only difference between DALL-E and GLIDE is DALL-E uses Image embeddings as well (CLIP image embeddings to be specific) while GLIDE generates the final image using just text embeddings
  • Hence, DALL-E is basically a modified GLIDE.

So, wrapping up, Let’s revisit the whole DALL-E architecture once again

  1. CLIP model training for generating models for text and image embedding using Contrastive Learning
  2. Concatenate the two embeddings and train a Decoder-only Transformer such that given text_embedding with random image_embedding, it gives CLIP image embedding as output using the Diffusion process keeping text embeddings untouched.
  3. Using GLIDE, a UNet trained using a Diffusion process intakes the text_embedding+image_embedding generated in the Prior step and produces generated image.

Enough for today, let’s meet sometime soon

--

--