How to make DALL·E 2 understand you better. Part 1

Elena K.
9 min readSep 4, 2022

--

The image was generated by DALL·E 2

Some weeks ago I finally got access to DALL·E 2 and I was thrilled. When I first tried it, I wasn’t totally impressed, as the results of my search were a little bit disappointing. However, It still seemed incredible because of the technologies behind it, so I continued testing it. And some time later I realized my bad and here’s the deal. It’s not enough just to make a random search request, it should be better structured. So today I’m going to give you some advice how to make your interaction with DALL·E 2 more effective.

Let me give you a bird’s-eye view into DALL·E 2 architecture which will help you to understand text preprocessing a little bit deeper.

A high-level overview of unCLIP. Source: arxiv.org

Let’s dive into the training process below the dotted line and look at the main steps:

  1. At the first step, we use a text encoder to get a CLIP embedding of the input text;
  2. Then the text embedding is fed to a diffusion prior to produce a CLIP image embedding;
  3. And finally, we an image decoder to produce image conditioned on the CLIP image embedding.

A few words about CLIP (Contrastive Language-Image Pre-training). The main task of CLIP is to match images to their corresponding captions: it learns the link between textual and visual representations of the same abstract object.

So we need to look at the data used for CLIP training. This dataset was constructed from 400 million (image, text) pairs collected from a variety of publicly available sources on the Internet.

One random (non-cherry-picked) prediction of CLIP. Source: openai.com

In the original CLIP paper the authors mention a problem that it was relatively rare in the training dataset for the text paired with the image to be just a single word. Here they solved the problem using the prompt template “A photo of a {label}”. Also, they highlight that sometimes it was helpful to specify the category during the “prompt engineering”.

For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well.

Since CLIP was pre-trained using data from the internet, a lot of text prompts had features such as what artistic style the image was created/drawn/rendered in, what company/organization created/released/published the image, who the creator was, etc. That’s why the training dataset included a bunch of prompts in the format like this: “{subject}, {camera angle}”, “{subject}, {style}”, “{subject}, {time}”, etc.

Before going further, I want to show you my first prompts, when I had no idea about how to make a request. Below you will see the first stage of my relationship with DALL·E 2 (my request is shown in italics).

I: I want to see the scariest monster.

DALL·E 2:

I: Could you show me the giant blue cat taking off in the middle of the ocean?

DALL·E 2:

I: What about the cyberpunk illustration of the androids dreaming of electric sheep?

DALL·E 2:

Did you expect to get something like this or not? Frankly speaking, I did not. Even if the 4th cat from the 2nd attempt reminds me a bit of Hayao Miyazaki and his Totoro, It seems to me quite plain. And it absolutely doesn’t look like the pics from the DALL·E 2 demo. It took me a while to make my art look better and I ensure you it is not that difficult. You just need to remember some moments to keep an eye on which we will cover below.

6 main prompts to remember

When you’re going to send your request, you need to answer some questions, which may help you to make a clearer request:

  1. What is an image composed of? (Composition)
  2. Which elements are more important and which — less? (Scale)
  3. How do the elements relate to one another? (Proximity)
  4. What angle is this image supposed to be from? (Position)
  5. How should the elements be lit? (Lighting)
  6. What image style is the most appropriate? (Style)

Here I want to discuss the first 4 steps from this list. We will begin with them because these are fundamental concepts of making an image. As for the other 2, they will be discussed in the next part.

Composition

Photo Composition is an arrangement of visual elements within their frame.

“It’s a pleasing organization of objects within your rectangle” — photographer Adam Long

Generally, there are several important rules about composition. But most of them can be used only in creating a photograph.

However, you will be able to follow some concepts during creating images with DALL·E 2. These are as follows:

  1. Elements of Composition (Lines, Shape and Form, Value, Space, Colour, Texture).
  2. Principles of Composition (Balance, Proportion, Harmony)

You can vary all these in your text prompt. Let’s see how it works in practice.

I: I want to see a man riding a motorbike at the edge of space, digital art.

DALL·E 2:

I: Ok, let’s add some details*.* Show me a man riding a motorbike at the edge of space, toward a planet, digital art

DALL·E 2:

I: What do you think if we add a dragon at the same picture? So I wanna get a man riding a motorbike at the edge of space, trying to get away from the white dragon, toward a planet, shadow, digital art.

DALL·E 2:

Here our composition is more comprehensive unlike the first picture. We can go on and on trying to make our composition more complex. So, here I want to move on and consider next part dealing with Scale, Proximity and Position.

Scale & Proximity & Position

These three concepts are closely related because they are responsible for the connection between objects in space. Nevertheless, let’s try to define each concept and then consider them as one entity.

Scale is an important element along with composition. It helps to increase 3D space in a photo, painting, or drawing. You can tune your scale using some size words.

As for proximity, usually looking at a photo, we can understand how the objects relate to each other and which one is more important in comparison with the others. You may also want to focus on some special part of an object. In fact, proximity is a thing which helps us to do that. The following values can be used:

  • (Extreme) Close-up shot
  • Medium shot
  • (Extreme) Wide shot
  • (Extreme) Long shot
  • Full shot

Position (angle and camera view) is also quite important concept because it helps you to demonstrate the ratio between objects. You can add angle value to the end of your prompt such as:

  • High angle
  • Low angle
  • Bird’s-eye view
  • Bug’s-eye view
  • Face to face
  • Over-the shoulder shot
  • Or you can specify the degree value (25-degree angle)

Now it’s time to check it out. Let’s assume we would like to save on food stylists and generate a new image for Chicken Caesar from this menu. For our prompt we will use the original description from the menu.

Chicken Caesar. Source: thebigsalad.com

I: high-quality photo of a salad with rocket romaine, parmesan cheese, savory grilled chicken breast, hearty croutons, caesar dressing, on a white plate, on the oak table.

DALL·E 2:

I: It’s better to show the whole plate in the picture: high-quality photo of a salad with rocket romaine, parmesan cheese, savory grilled chicken breast, hearty croutons, caesar dressing, on a white plate, on the oak table, wide shot.

DALL·E 2:

I: Let’s try to zoom out the plate a little bit: high-quality photo of a salad with rocket romaine, parmesan cheese, savory grilled chicken breast, hearty croutons, caesar dressing, on a white plate, on the oak table, extremely long shot, wide shot

DALL·E 2:

I: What about the angle? high-quality photo of a salad with rocket romaine, parmesan cheese, savory grilled chicken breast, hearty croutons, caesar dressing, on a white plate, on the oak table, extremely long shot, wide shot, 25-degree angle

DALL·E 2:

I: And camera view? a high-quality photo of a salad with rocket romaine, parmesan cheese, savory grilled chicken breast, hearty croutons, caesar dressing, on a white round plate, on the oak table, extremely long shot, wide shot, 25-degree angle, left side view

DALL·E 2:

As for me, it looks much fancier than the original image. And again we have almost endless opportunities to improve our image. We will continue our experiments in the next part.

Is it possible to make a person riding / running / walking on the Rings of Saturn?

The short answer is I don’t know. Really. And I have no idea why I have been trying in vain to get it but my brain generates a really astonishing image so I want to take a look at it.

I have done a lot of requests but I still have not got desired output. Here you can also see the result.

I: astronaut riding a motorbike on the Rings of Saturn, digital art.

DALL·E 2:

I: astronaut running on the Rings of Saturn, digital art.

DALL·E 2:

I: astronaut walking on the Rings of Saturn, digital art.

DALL·E 2:

You can see that the task of how to design a prompt to put a person on the Rings of Saturn is still open :)

Here we have considered the fundamental concepts of image generating such as Composition, Scale, Proximity, and Position. We have got pretty interesting images which we can try to improve using Lighting and Style.

Furthermore, despite all of these recommendations, you need to remember that you shouldn’t be afraid of being indistinct. You simply need to try different approaches and see what happens. Well, finally, you will come across the approach to DALL·E 2 and probably one day you will win first place at an art competition :)

Literature

  1. DALL·E 2 — https://openai.com/dall-e-2/
  2. Hierarchical Text-Conditional Image Generation with CLIP Latents — https://arxiv.org/pdf/2204.06125.pdf
  3. Learning Transferable Visual Models From Natural Language Supervision — https://arxiv.org/pdf/2103.00020.pdf
  4. DALL·E 2 Unbundling — https://bakztfuture.substack.com/p/dall-e-2-unbundling
  5. DALL·E 2 prompt book — https://pitch.com/v/DALL-E-prompt-book-v1-tmd33y

--

--