Open AI’s DALL·E
Why does the name DALL·E ring a bell?
You are probably thinking of Salvador Dali, the famous Spanish Surrealist painter known for exploring subconscious imagery or perhaps Pixar’s adorable robot WALL·E. If you had thought about either of these, then you are halfway right about the origin of the word DALL·E!
A picture is worth a thousand words is the paraphrased saying of Henrik Johan Ibsen, famously referred to as the father of realism. What DALL·E does is the opposite of his famous quote. Instead of a thousand words describing a picture, one line of descriptive text will generate thousands of images using DALL·E.
DALL·E is a trained neural network that converts text-to-image and image-to-image using a zero-shot approach for text captions expressed in natural languages. DALL·E uses a 12 billion parameter sparse transformer model, CLIP (Open AI’s model published in January 2021), and data set of image and text pairs.
Before DALL·E
In February month of 2019, Open AI introduced GPT- 2. It performed natural language processing operations like completing sentences, text summarization, or answering questions.
The main objective behind GPT-2 was to predict the next word using the previous words of the text.
Then in June of 2020, GPT -3 was introduced. It could generate website layouts, SQL codes from a written description.
Image GPT got introduced around the same time. It would complete an incomplete image based on the partial image input.
It was great walking down history lane but it’s time to get back to the future!
Now in January 2020, just six months after GPT-3 and Image GPT we have DALL·E, an AI program that generates images from descriptive text.
What can DALL·E do ?
- Control Attributes
DALL·E can control an object’s attributes such as colour, texture, shape etc. It also specifies how many or how much of that object should be present in the generated image.
“ a square blue light bulb. a blue light bulb in the shape of a square”.
In the input image, we specified the color, the shape, and the object.
The objects and the number of items are specified. Other examples are: a collection of glasses is sitting on a table, a stack of nails is sitting on a table, etc.
2. Visualize perspective, three-dimensional style, internal and external structures
Visualizing perspective means that generated image can be from any view such as bottom view, aerial view, side view, etc.
3D styles include voxels, claymation, isometric, and even x-ray style.
DALL·E draws the interiors and exteriors of several kinds of images like the brain, car, walnut, watermelon, flower, leaf, and many more. The images are very fine-grained and detailed.
3. Flexibility with the medium, time, and season
The medium can be anything from a chips bag, neon sign, soda can, mural, or even a purse.
“a stained glass window with an image of a blue strawberry”.
The medium can be specified, like the object and color.
4. Dressing and Interior designing
DALL·E can help you choose your outfit as well as design your room! It all depends on the partial image and description you give as input.
5. Abstract images and illustrations
“ In shape of ”, “made of”, “in the form of”, “in the style of” are the keywords while dealing with abstract images.
DALL·E can make illustrations of “anthropomorphize versions of animals and objects, animal chimeras, and emojis”.
6. Geographic facts, landmarks, and Temporal knowledge
DALL·E is aware of the history, the geography, and its neighborhood (San Francisco).
How does DALL·E work?
The goal was to train a transformer to auto-regressive model the text and image tokens as a single stream of data.
To overcome the problem of spending too much of the model’s capacity in capturing high frequency, we use two steps.
Step 1 :
DVAE (a discrete variational autoencoder) is trained to compress the 256 X 256 grid to the 32 X 32 grid. This gives us a result of 1024 image tokens instead of the original millions.
Step 2 :
These DVAE images and text tokens are sent to a 12 billion parameter sparse transformer model to get the optimal output image-text pairing. These pairs get optimized by using the CLIP model, Open AI’s model published in January 2021.
The Unexpected DALL·E
DALL·E’s main aim was “Text to Image generation”, but unexpectedly it can perform “Image to Image transformation” as well. It could easily rotate the image, draw a sketch, make a postage stamp, etc. Though these tasks are advanced, they are being performed at a rudimentary level by the model.
COMPARISON
During the evaluation of the models, MS-COCO and CUB are the two popular datasets in the text to image generation because of their relatively small size.
Inception Score and FID (Frechet Inception Distance) Score are used to compare models.
Let’s have a look at what these scores mean!
Inception Score :
- It evaluates only the distribution of the generated images.
- The higher the Inception Score value, the better the model.
FID(Frechet Inception Distance score):
- Compares the difference between the distribution of generated Images and distribution of real images used for training
- The lower the FID score, the better the model.
Comparison between models having high inception score and low FID score to find the best model takes place. The models having high inception score and low FID score are AttnGAN, DM-GAN, and DF-GAN.
It is clear that the Zero-shot DALL·E approach has a higher Inception score and lower FID score than the best previous models. The image below justifies that DALL·E is always better in comparison with other models.
It is to keep in mind that each of the Zero-shot method samples is the best of 512 as ranked by the contrastive model. The reason behind the best of 512 instead of best of 64 or best of 8 is justified in the image below.
We can conclude that we get the best result when we take the image as the best of 512 since the top row images are more realistic and accurate than the images in the lower rows.
Even in the comparison between Zero-shot and DF-GAN(prior work) using human evaluation in a best-of-five vote, the Zero-shot model is most realistic 90% of the time and also chosen as the best matching a shared caption 93.3% of the time.
Conclusion
DALL·E generates any image from illustrations to real-life. DALL·E’s official blog allows experimentation with different inputs.
DALL·E is better than the models before it. The graphical analysis concluded that DALL·E gets an exceptional result in the MS-COCO dataset. Even on the basics of generated image comparison, we can see that DALL·E performs far better since the images are more realistic and relevant to the text.
References
2. Zero-Shot Text-to-Image Generation
3. https://www.youtube.com/watch?v=C7D5EzkhT6A&t=280s
4. https://paperswithcode.com/sota/text-to-image-generation-on-coco