DALL·E mini Explained with Demo

How to create images from a text prompt: DALL·E Mini, a replica of the OpenAI DALL·E, explained and demonstrated.

3 min readJun 19, 2022

OpenAI had the first impressive model for generating images with DALL·E. DALL·E mini is an attempt at reproducing those results with an open-source model.

Content

History
Demo
Model Architecture DALL·E mini

History

A small group of scientists began work in mid-2021 on replicating the outcomes of OpenAI’s DALL·E using a smaller architecture.
They produced remarkable (though lower-quality) outcomes while having far fewer hardware resources. Their model is 27 times smaller than the original DALL·E and took only 3 days to train on a single TPU v3–8.
They did it in three days by simplifying the architecture and model memory needs, as well as utilizing open-source code and pre-trained models.

Demo

Anyone can leverage their app here!

Here are a few examples of what DALL·E mini can create.

As you can see the results are not very high definition, but still quite impressive! For reference, you can look at the performance of OpenAI’s most recent DALL-E 2 below:

Model Architecture DALL·E mini

Training Process

Images and descriptions are both provided during training and flow through the system in the following order:

Images are encoded through a VQGAN encoder, which turns images into a sequence of tokens.
Descriptions are encoded through a BART encoder.
The output of the BART encoder and encoded images are fed through the BART decoder, which is an auto-regressive model whose goal is to predict the next token.
Loss is the softmax cross-entropy between the model prediction logits and the actual image encodings from the VQGAN.

Inference Process

At inference time, one only has captions available and wants to generate images:

The caption is encoded through the BART encoder.
A <BOS> token (special token identifying the “Beginning Of Sequence”) is fed through the BART decoder.
Image tokens are sampled sequentially based on the decoder’s predicted distribution over the next token.
Sequences of image tokens are decoded through the VQGAN decoder.
CLIP is used to select the best generated images.