Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

6 min readNov 17, 2023

Author: Vidit Goel (200050156) and Nikhil Manjrekar (200050088)

In this blog, we discuss text-to-image model “Imagen” introduced in the paper Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding by Google research, a text-to-image diffusion model, captivates with an unparalleled degree of photorealism and a profound understanding of language. In the world of multimodal learning, where the spotlight is on text-to-image synthesis, Imagen stands out as a significant step forward. This model combines large transformer language models with diffusion techniques, offering a practical method for creating detailed images from written descriptions.

Background into Diffusion

Diffusion models exemplify generative AI by starting with an initial input x⁰ and systematically introducing Gaussian noise at each layer t until reaching a final layer, denoted as xᵀ. This concept draws inspiration from non-equilibrium thermodynamics, where states evolve through diffusion processes towards homogeneity over an extended time frame.

Diffusion models learn to reverse this process in an attempt to generate the original x⁰ from xᵀ (where in this case x⁰ is an image). See the figure above for a visual aid in this.

The denoising process of diffusion models, by preserving the image at each step, establishes a closer and more intimate link between the data and predictions in contrast to non-diffusion-based text-to-image generators. Consequently, the notable outcome is a generally more photorealistic output achieved by diffusion-based models.

Previous Work

Over the past few years, people have been trying to turn text into images. In the early days, though, it was a struggle to realistically blend different textual ideas into one picture. OpenAI stepped up with DALL-E, a game-changer that could smoothly weave together various unrelated concepts into a single image, row by row. Just under a year later, OpenAI shook things up again, shifting gears to diffusion models with GLIDE. According to human evaluators, GLIDE outshone other methods in terms of making realistic images and matching captions in different scenarios, solidifying the dominance of diffusion models in turning text into pictures.

Later, DALL-E 2 took things a step further. It improved the text-to-image game by creating images based on an encodings derived from the image embedding of the given text prompt. While there were other cool advancements during this time, We’ve focused on three major breakthroughs that lay the groundwork for Imagen.

Imagen

Architecture

Pretrained Text Encoders: Imagen maximizes the potential of pretrained text encoders (BERT, T5, CLIP) for text-to-image synthesis, diverging from conventional models trained on image-text data. Freezing encoder weights offers computational advantages, and scaling text encoder size is found to significantly enhance text-to-image generation quality. Human evaluations favor T5-XXL over CLIP in both image-text alignment and fidelity.
Diffusion Models and Classifier-Free Guidance: Imagen adopts diffusion models and introduces classifier-free guidance, sidestepping issues with high guidance weights. Dynamic thresholding actively prevents pixel saturation, resulting in superior photorealism. The model achieves an effective balance, improving image-text alignment while avoiding quality degradation associated with prior methods.
Robust Cascaded Diffusion Models: Imagen’s robust architecture includes a base 64 × 64 model and two text-conditional super-resolution diffusion models. Noise conditioning augmentation enhances image fidelity, and the Efficient U-Net variant ensures improved memory efficiency. Cascaded models, aware of noise levels, contribute to generating high-quality images across resolutions.

Drawbench: A Comprehensive Evaluation Benchmark

To overcome COCO’s limitations, Imagen introduces DrawBench, offering diverse prompts for a more insightful evaluation of text-to-image models. With 11 categories probing various capabilities, including color rendering and complex interactions, DrawBench facilitates a direct comparison of models. Human raters consistently prefer Imagen over others in side-by-side comparisons, emphasizing both sample quality and image-text alignment.

Results and Performance Metrics

Imagen’s excellence is quantified through a state-of-the-art FID score of 7.27 on COCO. Zero-shot FID-30K comparisons surpass previous benchmarks like GLIDE and DALL-E 2. The evaluation results from DrawBench emphasize Imagen’s superiority in human preferences during pairwise comparisons with DALL-E 2, GLIDE, Latent Diffusion, and CLIP-guided VQ-GAN models. The comparisons, focusing on caption alignment and fidelity, showcase Imagen’s notable performance over its counterparts. This strong preference in DrawBench evaluations underscores Imagen’s advancements in generating high-quality and contextually accurate images, as perceived by human raters.

Conversely, results from the COCO validation set, a standard benchmark in text-to-image models, do not exhibit significant distinctions between different models. The authors briefly touch upon these results, suggesting a more comparable performance among models in this setting. Notably, the limited capability of Imagen to generate photorealistic people on the COCO dataset is an intriguing observation. However, the paper lacks qualitative examples illustrating the extent of this limitation in Imagen’s people generation. This observation hints at potential nuances and challenges in handling certain image categories, urging further exploration in future research.

Disucssion

Subjectivity in Evaluation Metrics

The paper asserts Imagen’s exceptional photorealism and language understanding in text-to-image synthesis, but the reliance on human raters introduces subjectivity. The discontinuous nature of metrics, especially in choosing the most photorealistic image, raises interpretational challenges. There is a need for continuous evaluation methods, possibly incorporating difficulty-based weighting for enhanced reliability.

Challenges in Capturing Complexity

As the field advances and models become more impressive and creative, the current evaluation methods discussed in the paper and that are used are less reliable. The chosen metrics, primarily faithfulness and caption alignment, potentially favor Imagen’s strengths, emphasizing the need for a broader and more diverse set of evaluation criteria.

DrawBench as a Benchmark

While the release of DrawBench is justified as a contribution to the text-to-image research field, the benchmark, comprising around 200 text prompts across 11 categories, is questionable in comparison to larger datasets like COCO. The field of text-to-image synthesis is dynamic in nature, and there is need for ongoing adaptation or expansion of benchmarks. The potential bias in DrawBench’s construction, including the absence of people in image generation, raises questions about its alignment with diverse real-world scenarios.

Conclusion

To sum up, the authors have made big strides in text-to-image synthesis with their model ‘Imagen.’ Despite not being available to the public for ethical reasons, the model incorporates cool techniques like off-the-shelf text encoders and efficient U-Net architectures. I enjoyed reading the paper and find the contributions exciting. However, the authors might be overselling Imagen and DrawBench. It’ll be interesting to see more thorough evaluations in future publications or for select researchers with access to Imagen. Looking forward to seeing how the field evolves!