Understanding How AI Generates Images from Text

Published in

Generative AI Insights for Business Leaders and Storytellers

3 min readSep 1, 2023

Recent advances in AI image generation have led to impressive results, with systems like Stable Diffusion able to create highly realistic images from simple text prompts. But while these models can generate remarkably detailed pictures, they remain something of a black box. We don’t have much insight into how they actually convert the text into pixel representations.

New research from computer scientists is helping peel back the curtain on these AI systems. In a paper titled “What the DAAM: Interpreting Stable Diffusion Using Cross Attention”, researchers propose a method called DAAM (Diffusion Attentive Attribution Maps) to analyze how words in a prompt influence different parts of the generated image.

DAAM creates heat maps showing which pixels are most related to each word in the text prompt. For example, for a prompt like “a blue bird flying”, the word “blue” would highlight the blue parts of the bird, “bird” would highlight the full bird, and “flying” would highlight the motion blurred wings and body.

By aggregating attention scores between text and image patches across the AI model’s layers, DAAM produces interpretable maps linking words to visual features. The researchers validated DAAM by testing how well it can perform noun segmentation, a common computer vision benchmark where the goal is to identify the regions in an image that correspond to noun objects. DAAM achieved competitive scores, despite having no explicit training.

Experiments with DAAM revealed new insights about these generative AI systems:

The relationships between words in a prompt translate to visual relationships in the image. For example, verbs like “flying” encapsulate their subjects like “bird”.
Using similar words like “giraffe” and “zebra” leads to worse image generation, likely because their features get entangled. The DAAM maps heavily overlap for such co-hyponyms. Essentially, co-hyponyms are words that share the same superclass or category but are not synonyms. They have a sort of “sibling” status as distinct members of their parent group. Giraffe and zebra are co-hyponyms. They both belong to the hypernym “wild animals”.
Descriptive adjectives like “blue” attend too broadly across the whole image, suggesting objects are entangled with their surroundings. Changing the adjective modifies the entire scene.

For business leaders, research like DAAM is important because it improves the explainability of AI systems. As generative models become more ubiquitous, understanding how they operate will help identify limitations and better assess risks. Models that entangle features more could potentially suffer from bias or produce unrealistic outputs.

DAAM also demonstrates how attention mechanisms in AI models can be repurposed for interpretation, without retraining the models from scratch. This allows transparent analysis without compromising performance.

Overall, DAAM represents an impactful step toward explainable AI in generative models. Demystifying these systems will be key as businesses increasingly look to utilize powerful generative AI capabilities in their products and processes. Interpretability helps ensure these technologies are trustworthy and dependable.

Sources:

arxiv

Generative AI Insights for Business Leaders and Storytellers: Learn more about the scope of this publication here.

Understanding How AI Generates Images from Text

Written by Now Next Later AI