DALL-E 3 — More Advanced Than Ever Before

3 min readDec 19, 2023

Despite the remarkable strides in text-to-image generative models, the issue of prompt following persisted. Existing models struggled to align words, their ordering, and the intended meaning within a given caption.

This challenge, often referred to as prompt following, prompted researchers to seek solutions.

Several works, including Rassin et al. (2022), Saharia et al. (2022), and Yu et al. (2022b), shed light on the limitations of prompt following in models like DALL-E 2.

Whether it was the lack of constraints on word meanings, the need for pre-trained language models, or the scaling of autoregressive image generators, the problem persisted.

Caption Improvement
In response to these challenges, the creators of DALL-E 3 proposed a unique approach: caption improvement.

Their hypothesis centered around the notion that existing models suffered from the poor quality of text and image pairings in their training datasets.

To combat this, they introduced a bespoke image captioner designed to produce detailed and accurate descriptions of images. These enhanced captions were then used to retrain text-to-image models.

Synthetic Captions for Enhanced Training
The concept of training on synthetic data is not…

DALL-E 3 — More Advanced Than Ever Before

Written by Abe Bellini