Open AI’s DALL·E

Sandhya Vinukonda

Follow

Published in

Google Developer Students Club Vishwakarma Institute of Technology, Pune

7 min readMar 11, 2021

--

Why does the name DALL·E ring a bell?

You are probably thinking of Salvador Dali, the famous Spanish Surrealist painter known for exploring subconscious imagery or perhaps Pixar’s adorable robot WALL·E. If you had thought about either of these, then you are halfway right about the origin of the word DALL·E!

The word DALL·E is a combination of **Salvador Dali** and **WALL·E.**

A picture is worth a thousand words is the paraphrased saying of Henrik Johan Ibsen, famously referred to as the father of realism. What DALL·E does is the opposite of his famous quote. Instead of a thousand words describing a picture, one line of descriptive text will generate thousands of images using DALL·E.

DALL·E is a trained neural network that converts text-to-image and image-to-image using a zero-shot approach for text captions expressed in natural languages. DALL·E uses a 12 billion parameter sparse transformer model, CLIP (Open AI’s model published in January 2021), and data set of image and text pairs.

Before DALL·E

In February month of 2019, Open AI introduced GPT- 2. It performed natural language processing operations like completing sentences, text summarization, or answering questions.

The main objective behind GPT-2 was to predict the next word using the previous words of the text.

Then in June of 2020, GPT -3 was introduced. It could generate website layouts, SQL codes from a written description.

An example of code generation (JSX code) using GPT-3.

An example of code generation (SQL code) using GPT-3.

Image GPT got introduced around the same time. It would complete an incomplete image based on the partial image input.

The partial images of popular memes are the input. Image GPT completes this partial image.

It was great walking down history lane but it’s time to get back to the future!

Now in January 2020, just six months after GPT-3 and Image GPT we have DALL·E, an AI program that generates images from descriptive text.

What can DALL·E do ?

Control Attributes

DALL·E can control an object’s attributes such as colour, texture, shape etc. It also specifies how many or how much of that object should be present in the generated image.

“ a square blue light bulb. a blue light bulb in the shape of a square”.

In the input image, we specified the color, the shape, and the object.

The objects and the number of items are specified. Other examples are: a collection of glasses is sitting on a table, a stack of nails is sitting on a table, etc.

2. Visualize perspective, three-dimensional style, internal and external structures

Visualizing perspective means that generated image can be from any view such as bottom view, aerial view, side view, etc.

“The **side view** of an **owl sitting on a mountain”.**The object, the visualization perspective, and the place are specified.

3D styles include voxels, claymation, isometric, and even x-ray style.

“a **fox** **claymation** **sitting in a forest”.**

DALL·E draws the interiors and exteriors of several kinds of images like the brain, car, walnut, watermelon, flower, leaf, and many more. The images are very fine-grained and detailed.

“a cross-section view of a **car**”(left) and “a macro photograph of a **frozen berry on a tree branch**”(right).

3. Flexibility with the medium, time, and season

The medium can be anything from a chips bag, neon sign, soda can, mural, or even a purse.

“a stained glass window with an image of a blue strawberry”.

The medium can be specified, like the object and color.

“a … of a **fox sitting in a field at twilight**”. The image is rendered in different styles, lighting, shadow, time of the day, and even the season.

“ a **neon sign** that reads **‘clip’** ”.

4. Dressing and Interior designing

DALL·E can help you choose your outfit as well as design your room! It all depends on the partial image and description you give as input.

“a female mannequin dressed in a **white cardigan** and **black wrap sweater**”.

“a loft bedroom with a **pink** bed next to the **shelf**. there is a **guitar** standing beside the bed”.

5. Abstract images and illustrations

“ In shape of ”, “made of”, “in the form of”, “in the style of” are the keywords while dealing with abstract images.

“a **penguin** made of **cursive letters**. a **penguin** with the texture of **cursive letters**”. Combining unrelated topics such as a penguin and cursive letters .

DALL·E can make illustrations of “anthropomorphize versions of animals and objects, animal chimeras, and emojis”.

“an illustration of a **baby daikon radish in a tutu walking a dog**”.

.“a professional high-quality emoji of a **cute slice of pepperoni pizza**”.

6. Geographic facts, landmarks, and Temporal knowledge

DALL·E is aware of the history, the geography, and its neighborhood (San Francisco).

How does DALL·E work?

The goal was to train a transformer to auto-regressive model the text and image tokens as a single stream of data.

To overcome the problem of spending too much of the model’s capacity in capturing high frequency, we use two steps.

Step 1 :

DVAE (a discrete variational autoencoder) is trained to compress the 256 X 256 grid to the 32 X 32 grid. This gives us a result of 1024 image tokens instead of the original millions.

The top image is the original image. The image below is the DVAE image. Though the main features in the second image are still visible we can see that the details such as texture of the cat’s fur or the green colour of the stand outside the store is lost or distorted.

Step 2 :

These DVAE images and text tokens are sent to a 12 billion parameter sparse transformer model to get the optimal output image-text pairing. These pairs get optimized by using the CLIP model, Open AI’s model published in January 2021.

The Unexpected DALL·E

DALL·E’s main aim was “Text to Image generation”, but unexpectedly it can perform “Image to Image transformation” as well. It could easily rotate the image, draw a sketch, make a postage stamp, etc. Though these tasks are advanced, they are being performed at a rudimentary level by the model.

The image of the cats has been shaded red, added with sunglasses, and also made into a postage stamp.

COMPARISON

During the evaluation of the models, MS-COCO and CUB are the two popular datasets in the text to image generation because of their relatively small size.

Inception Score and FID (Frechet Inception Distance) Score are used to compare models.

Let’s have a look at what these scores mean!

Inception Score :

It evaluates only the distribution of the generated images.
The higher the Inception Score value, the better the model.

FID(Frechet Inception Distance score):

Compares the difference between the distribution of generated Images and distribution of real images used for training
The lower the FID score, the better the model.

Models before Jan 2021 having low FID scores on the MS-COCO dataset.

Comparison between models having high inception score and low FID score to find the best model takes place. The models having high inception score and low FID score are AttnGAN, DM-GAN, and DF-GAN.

Comparison between FID score of methods in MS-COCO dataset.

Comparison between Inception score of methods in MS-COCO dataset.

It is clear that the Zero-shot DALL·E approach has a higher Inception score and lower FID score than the best previous models. The image below justifies that DALL·E is always better in comparison with other models.

Comparison between the images generated by the models.

It is to keep in mind that each of the Zero-shot method samples is the best of 512 as ranked by the contrastive model. The reason behind the best of 512 instead of best of 64 or best of 8 is justified in the image below.

Contrastive reranking procedure using Best of 1, 8, 64 and 512.

We can conclude that we get the best result when we take the image as the best of 512 since the top row images are more realistic and accurate than the images in the lower rows.

Even in the comparison between Zero-shot and DF-GAN(prior work) using human evaluation in a best-of-five vote, the Zero-shot model is most realistic 90% of the time and also chosen as the best matching a shared caption 93.3% of the time.

Conclusion

DALL·E generates any image from illustrations to real-life. DALL·E’s official blog allows experimentation with different inputs.

DALL·E: Creating Images from Text

DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of…

openai.com

DALL·E is better than the models before it. The graphical analysis concluded that DALL·E gets an exceptional result in the MS-COCO dataset. Even on the basics of generated image comparison, we can see that DALL·E performs far better since the images are more realistic and relevant to the text.