From Dali to DALL-E: Decoding the Art of AI

Reverse engineering Dall-E case study

5 min readJul 22, 2023

--

Image generated by DALL-E, provided by the AI Co-Creation course at Reichman University — Target image generated by DALL-E, provided by the AI Co-creation course at Reichman University

1. Embarking on a Journey: Unraveling DALL-E

Is it possible to reverse-engineer DALL-E’s image generation? To find out, I approached DALL-E as a supervised model, with the user (myself) acting as the supervisor through prompt adjustments. I experimented with different prompt strategies, observing the model’s responses and forming impressions based on the outcomes.

2. Diving into the Depths: The Latent Space of DALL-E

First trial to recreate the original image above: [“A monochromat (grey) 3d illustration of a rounded joystick on top of the drywall flat layers on top of three very thin and flat visible drywall layers”]

DALL-E is likely to sample points from a latent space; a lower-dimensional space where complex data is represented in a simplified, compressed form. Each point in this latent space corresponds to a potential image, enabling DALL-E’s prompt-based image generation. However, due to the complexity and multitude of possibilities within the latent space, accurately reproducing a specific image becomes a challenging task, leading to a degree of variability and unpredictability in the images that DALL-E generates.

3. The Building Blocks: Understanding DALL-E’s Dataset

DALL-E’s training likely involved sampling from a latent space using text-image pairs. When the model encounters a prompt that doesn’t align with its training pairs, it can result in distorted images. For instance, consider a prompt involving a joystick and a drywall board. These are not common pairings in everyday contexts, and it’s unlikely that DALL-E was trained extensively on such combinations.

When asked to generate an image based on this unusual pairing, DALL-E might produce a distorted or unexpected result. This distortion can occur when the model tries to map the unfamiliar prompt to the closest familiar image-text pairs it was trained on.

Prompt: [“3 layers of drywall boards one on top of the other, and on the top flat layer a white grainy joystick with no colors”] — Comparing DALL-E’s results in Midjourney

Interestingly, a comparison of DALL-E’s results with those of Midjourney (MJ) reveals a significant difference in their focus. While DALL-E tends to concentrate on the object described in the prompt, MJ appears to give precedence to the relations between the elements. This suggests that the training dynamics of different models can significantly influence their output, leading to a high degree of variability and unpredictability in the images generated.

4. Also the Ones I Lost, I’ve Won: Gauging DALL-E’s Success

How do we measure DALL-E’s success in recreating a target image? The loss function, which measures the discrepancy between the generated and target images, is one way. However, high loss values often resulted from adjusting prompts to align with the target image. But in the realm of creative AI like DALL-E, a high loss value doesn’t necessarily mean failure. Each trial, regardless of the outcome, contributes to our understanding of DALL-E’s capabilities and limitations, turning every “loss” into a “win”.

5. The Art of Imitation and Inspiration: DALL-E’s Style Replication and Potential Plagiarism

Collecting the winnings, I ventured into the realm of artistic influence, evaluating DALL-E’s ability to replicate specific styles. It was evident that the model can capture underlying patterns, textures, and shapes, but often struggles to combine them coherently. To delve deeper into DALL-E’s ability to both imitate and innovate, I used a prompt inspired by Salvador Dali’s style: “Melting clocks like The Persistence of Memory of Dali.”

Dall-E: Prompt: [“Melting clocks like The Persistence of Memory of Dali”]

Midjourney: Prompt: [“Melting clocks like The Persistence of Memory of Dali”]

I then compared the outputs of DALL-E to those of Midjourney (MJ). It became clear that MJ was able to incorporate elements of Dali’s background style, producing a more coherent and artistic composition than DALL-E. This suggests that MJ draws more inspiration from previously learned images, potentially involving some level of plagiarism. This could be attributed to the inclusion of actual artworks in MJ’s training data.

6. The Power of Words: How Writing Style Influences DALL-E

The style of writing in the prompt can also significantly influence the resulting image. Notably, the order of terms in the prompt played a significant role in directing the model’s focus on specific elements. For instance, when the term “joystick” appeared first in the prompt, the model’s accuracy in generating that element improved. This observation is reminiscent of how people often listen to the beginning of a sentence and complete it based on their own experiences and expectations.

7. The Final Verdict: Deciphering the Mysteries of DALL-E

After dozens of trials with several target images, the training dynamics of DALL-E remain an enigma. The endeavor to reverse engineer DALL-E to replicate specific images proved to be a challenging task, yielding diverse outcomes. The trials conducted underscored DALL-E’s inherent randomness in its training process, possibly attributed to aspects of its latent space generation.

Some of the trials to recreate the target image (On the left) — Dall-E’s Target image is on the left. recreation trials on the right.

This randomness contributes to the generation of different images from the same prompt, emphasizing the inherent variability and unpredictability of DALL-E’s image generation process. Reproducing exact images from prior prompts remains a significant challenge.

As we continue to explore the dance between human creativity and AI’s capabilities, we’re left with a thought-provoking question: “In this dance, who’s leading whom?”

This article was created as part of the AI Co-creation course in the MA HCI program at Reichman University, with the guidance of Professor Doron Friedman and Dan Pollak.