The Artistic Potential of AI: Understanding DALL-E 2 and Stable Diffusion

Machine Minds

Follow

Published in

SFU Professional Computer Science

13 min readFeb 11, 2023

--

Authors: Ayman Shafi, Gaurav Bhagchandani, Naina Thapar, Raghu Kapur, Shubham Bhatia

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit sfu.ca/computing/mpcs.

An Overview

Artificial Intelligence has transcended science fiction and is now an integral part of the digital world. It is in essence, the creation of machines that think and act like humans would, with the ability to learn and make decisions.

Up until recently, AI was restricted to a particular set of activities or use cases, and models could only deal with a limited number of issues. This is where generative AI enters the picture, revolutionizing the field of artificial intelligence. The term “generative AI” refers to the development of AI models that are capable of producing original, previously unheard text, visuals, sounds, or even entire songs. This expands the realm of possibilities and uses for AI, making it more approachable and practical for a larger group of people.

DALL-E 2, a cutting-edge AI model created by OpenAI that can create original images based on verbal descriptions, is one such example of generative AI. It is one of the most sophisticated image generators currently on the market because of its transformer-based design and extensive training data. With DALL-E 2, creatives, designers, and artists may now produce a wide spectrum of imaginative visuals, opening up a whole new universe of potential applications for AI.

Stable Diffusion is another deep learning model for Generative AI, developed by a team of researchers at LMU Munich, known as CompVis. It is an open-source text-to-image model which uses a frozen CLIP ViY-L/14 text encoder to generate images from text prompts. It starts with a noisy version of an image and iteratively improves until there is no noise left in the image, allowing it to best match with the text description.

AI and Generative AI are rapidly changing the way we perceive and interact with technology, and with models like DALL-E 2 and Stable Diffusion, the possibilities are truly endless. Whether you’re an artist looking to push the boundaries of creativity or a scientist seeking to solve complex problems, AI has something to offer for everyone.

Next, we will be going through the technology that powers DALL-E 2. For those of you who are only interested in seeing examples and generating your own images through AI, click here to jump straight there.

The Technology Behind DALL-E 2

DALL-E 2 is a transformer-based deep learning model that generates good-quality images from verbal descriptions. It was developed by the researchers at OpenAI and is considered to be one of the most advanced image generators available.

The training for this model was performed on a very large dataset of images, allowing it to generate creative and varied images that could range from realistic to highly stylized. Input to the model is a textual description, and the output is an image generated based on that description.

Neural Network Architecture

The architecture of DALL-E 2 is as follows — the input embedding layer is a 12-layer transformer in which each layer consists of a multi-head self-attention mechanism along with a feed-forward module.

The model takes a verbal description as input and passes it through the input embedding layer. Here, the text gets converted into vector representation which is then processed by the neural network. The vector representation is passed to transformer layers which output a representation which captures the context of the input text.

The fundamental element of a transformer-based architecture is the multi-head self-attention mechanism. The process is based on comparing the attention scores of each input text element to each of the other elements. This is used to determine the weighted contribution of each component to the outcome. This system, known as “multi-head,” can compute several attention scores concurrently, enabling it to capture many facets of the interactions between parts of the input text.

The attention score is passed as input to the feed-forward network which is a fully connected network to produce a final representation of the input text. The feed-forward network is typically implemented as a sequence of fully connected layers, with non-linear activation functions applied between each layer.

Finally, the output layer in DALL-E 2 is responsible for converting the output of the feed-forward network into the final image. The output layer is usually implemented as a series of convolutional and deconvolutional layers that map the output of the feed-forward network to the image. The number of layers and the size of the filters used in the output layer are determined based on the desired image resolution and the size of the input text.

Now, there are two main technologies that power this architecture. The first is known as CLIP, or Contrastive Language-Image Pre-training. While that sounds like a daunting name, it is actually a pretty simple concept.

CLIP

CLIP is an embedding model i.e. it is used to generate a latent space that connects text and images. A latent space is a representation of data in a higher dimension such that similar data can be found close to each other. So, in the case of CLIP embeddings, you would find images of cats clustered together, images of dogs clustered together and so on.

This allows us to create variations of images in an extremely simple way. We simply move in the latent space to create a variation in the image. The image below might help you visualize what moving through the latent space would look like in the final result.

a photo of a cat → an anime drawing of a super Saiyan cat

a photo of a Victorian house → a photo of a modern house

a photo of an adult lion → a photo of lion cub

a photo of a landscape in winter → a photo of a landscape in fall

As you can see, certain features of the image do not change significantly while others do. Another way to imagine this is moving in a 2D plane where the variation along the X-axis is much more than the variation along the Y-axis. The X-axis are the important features that change while the Y-axis would be the unimportant ones that remain the same.

Let’s understand how CLIP is trained to generate these embeddings. The image below depicts the model from the research paper by OpenAI. The basic idea is to train on a pair of images and text using a contrastive objective function. The way a contrastive objective function works by rewarding correct pairs and penalizing incorrect pairs. So, in the model below, all pairs on the diagonal are correct and rewarded while all other pairs are penalized. The actual name of the loss function used in the paper is a symmetric cross entropy loss. The loss formula is calculated over the cosine similarities of N similar pairs and N² — N dissimilar pairs of data.

CLIP trains a text encoder and an image encoder jointly on this objective function to minimize the cross entropy loss. It then tests the generated model by passing it a new image and checking the corresponding text output generated by the model. The final result of this is a model that can be used to map a text description to a representation of an image in high-dimensional space.

The end result is a model that can be used for zero-shot prediction. Zero-shot prediction means predicting an unseen category of data. This is different from our testing phase, where we test the model with new images but from categories it has already seen. As an example, we could train the model on images of dogs, and then let it know that a category called wolves exists around dogs through textual description. Then, if the model can guess that it is looking at a photo of a wolf during testing without ever having seen a wolf during training, we would call it zero-shot capable.

Diffusion Model

The second technology behind DALL-E 2 is a diffusion model. Diffusion models basically generate data that is similar to the data that they are trained on. They are trained by adding noise to the training image and making the AI recreate the denoised image as closely as possible. This makes them non-deterministic, since every image denoised by the AI will have slight variations. In DALL-E 2, a diffusion model is used in two places — a decoder and a prior.

A decoder non-deterministically converts an image embedding to an actual image. This means that it retains the main features of the image while varying the other details. The below image taken from the research paper should help visualize it. DALL-E 2 uses OpenAI’s GLIDE for decoding with a slight modification. It adds the generated CLIP embeddings to give GLIDE more context. We won’t go into detail about GLIDE, but the research paper for it is added in the References section at the end.

Variations of an input image by encoding with CLIP and then decoding with a diffusion model. The variations preserve both semantic information like presence of a clock in the painting and the overlapping strokes in the logo, as well as stylistic elements like the surrealism in the painting and the color gradients in the logo, while varying the non-essential details.

However, we need an image embedding to pass to the decoder. To do this, we use a prior that converts a CLIP text embedding to a CLIP image embedding. This image embedding can be used by the decoder to create variations of an image.

While decoding, the amount of details generated by the image will be based on the number of Principal Components retained in the CLIP latent. More dimensions equate to higher detail but extra computation. Fewer dimensions means a faster computation at the loss of detail.

Visualization of reconstructions of CLIP latents from progressively more PCA dimensions (20, 30, 40, 80, 120, 160, 200, 320 dimensions), with the original source image on the far right. The lower dimensions preserve coarse-grained semantic information, whereas the higher dimensions encode finer-grained details about the exact form of the objects in the scene.

Finally, DALL-E 2 also uses two upscaling models. The first one upscales the result from 64x64 to 256x256. The second one further upscales this to 1024x1024 to generate a high resolution image. You can find more details about these in the referenced research paper for DALL-E 2.

The part above the dotted line represents CLIP being trained. The same text embedding is then used below the dotted line which represents the image generation process.

So, to recap, the flow of DALL-E 2 is as follows:

Generate CLIP Text embeddings for possible text captions.
Use a diffusion prior model to convert text embeddings to image embeddings.
Use a diffusion decoder to stochastically generate an image from an image embedding.
Upscale the generated image from 64x64 to 1024x1024.

Tutorial: How to use the DALL-E 2 API

Visit the OpenAI API website and sign up for an API key. This will provide you with access to the DALL-E 2 API.

Choose a programming language and install the necessary packages specific to the API. For illustration, let us use python

pip install openai

Write a script to send a text description to the API and receive an image in return. The API accepts a text description as an input and returns a URL to an image.

import openai

openai.api_key = '<YOUR_API_KEY>'
response = openai.Image.create(prompt="a detective puppy")
image_url = response['data'][0]['url']
print(image_url)

Experiment with different descriptions to see the different images that DALL-E 2 can generate. You can try describing objects, animals, scenes, or even fictional characters and creatures.
You can also control the size and style of the image by adding specific parameters to your API request. For example, you can specify the image size (e.g., 512x512 pixels), number of images (1–10) and more.

import openai

openai.api_key = '<YOUR_API_KEY>'
response = openai.Image.create(
  prompt="teddy bears working on AI research in outer space",
  n=1,
  size = "512x512"
)
image_url = response['data'][0]['url']
print(image_url)

Once you have generated a set of images, you can use them for various purposes, such as training a machine learning model, creating a website or application, or even just for personal use.
Please remember that DALL-E 2 is still a work in progress, thus the outcomes may be far from perfect. However, it’s certain that the model’s capabilities and quality will improve with time.

Comparison between DALL-E 2 and Stable Diffusion

Which of these models is better, you ask? It really depends on what you’re looking for. If your requirement is unique and high-quality images, then DALL-E 2 is the way to go. However, if you’re looking for a model that is stable and robust enough that it can handle real-world applications, then Stable Diffusion is the right option.

Limitations

DALL-E 2

DALL-E still seems to struggle with language comprehension where it can sometimes produce false positives. “A horse doing oil painting” can sometimes be interpreted as “A horse’s oil painting”
There can also be mismatched colors for different objects, where a prompt like- “A red flower and a blue book” might produce an image containing a blue flower and a red book
Some inputs, like scientific writing or medical images are not handled correctly and almost always produce nonsensical output
It has limited access as it is closed-source and cannot be run on user CPU / GPU
Generated images can be dull and plain if not provided with enough specific modifiers

Stable Diffusion

Generation of human limbs and teeth in images is an ongoing issue due to the poor quality of limbs in images found in the LAION dataset — on which the Stable Diffusion model was trained. Asking the model to generate such images can confuse the model and can generate incomprehensible or “weird” looking images
It can eliminate text that doesn’t seem to fit, producing less than ideal images that don’t match the prompt
Although the model is open-source and can run on user GPUs, yet running the model in consumer electronics is challenging, even when the model is further trained on high-quality photos
Trying to fine-tune the model with new images is dependent on the quality of new data; reduced resolution images from the source data can not only cause the model to fail to learn the new task but also will reduce its overall performance

10 Creative Prompts for Generative AI Art (with side-by-side comparison)

Generate a futuristic cityscape that combines organic and mechanical elements, such as flying cars and towering green trees

Panda bear sewing a sweater in space 4k cinematic

A high tech solar punk utopia in the Amazon rainforest, high quality

Design a portrait that merges together different animals and plants, creating a unique hybrid creature. Think about the different textures, colors, and shapes that each element brings, and consider how they can be combined in unexpected and interesting ways, high quality, cinematic, 4k

16 bit pixel art, outside of a coffee shop on a rainy day, light coming from windows, cinematic still, HDR

Cute isometric island, cottage in the woods, waterfall, animals, made with blender

Woman sitting at breakfast table, detailed facial features, vintage photograph, fuji color film, 2000

An expressive oil painting of a basketball player dunking, depicted as an explosion of a nebula

“A sea otter with a pearl earring” by Johannes Vermeer

Futuristic footwear, inspired by God of War, by Nike, hyper realistic, super detailed, shiny, 3D

Try it yourself

DALL-E 2

You can create your own images using DALL-E 2 here. OpenAI gives you 50 free credits when you sign up, and 25 extra credits at the start of every month to play around.

Stable Diffusion

There an online version of Stable Diffusion available here for free. If you want, you can also download the pre-computed model provided by the creators of Stable Diffusion onto your own machine and run it. Also, you can use this Colab Notebook for a code-based implementation of Stable Diffusion API.

Midjourney

Midjourney is another well-known image generation model, available through their Discord server. They give 25 credits to each new Discord account to experiment with, and their Discord Bot can be invited to your own server so that you can show off your creations to your friends.

Conclusion

Generative AI is a rapidly expanding field within Artificial Intelligence that has opened up a new world of possibilities for creative expression and problem-solving. DALL-E 2 by OpenAI is one of the most advanced models of generative AI that utilizes two key technologies, CLIP and a diffusion model, to generate unique images based on natural language prompts. The model’s success showcases the potential of Generative AI in revolutionizing the digital world. Today, anyone can leverage this cutting-edge technology to bring their creative ideas to life.

The Artistic Potential of AI: Understanding DALL-E 2 and Stable Diffusion

An Overview

The Technology Behind DALL-E 2

Neural Network Architecture

CLIP

Diffusion Model

Tutorial: How to use the DALL-E 2 API

Comparison between DALL-E 2 and Stable Diffusion

Limitations

DALL-E 2

Stable Diffusion

10 Creative Prompts for Generative AI Art (with side-by-side comparison)

Try it yourself

DALL-E 2

Stable Diffusion

Midjourney

Conclusion

References

Written by Machine Minds