Fine-tuning DALL·E Mini (Craiyon) to Generate Blogpost Images

Nudging text-to-image models to illustrate abstract titles.

Julia Turc
9 min readAug 5, 2022

Text-to-image models like DALL·E are largely seen as off-the-shelf tools that require no further fine-tuning and can be controlled solely via text prompts. However, existing models were trained to illustrate concrete entities explicitly mentioned in the prompt. How can we nudge them to come up with their own visual metaphors given nothing more than a potentially abstract blogpost title?

Images generated by Craiyon (former Dalle-Mini) when prompted with article titles from The Startup publication. These entries were cherry-picked to emphasize the frequency of teal-background flat illustrations.

Illustrating Abstract Concepts

As an occasional Medium writer, I find myself agonizing over what sort of illustration to choose for my (admittedly dry) articles about natural language processing. There is nothing inherently visual in “Trends in Model Pre-training for Natural Language Understanding”. Most of the time I just admit defeat, type a trivial word like “text” into the Unsplash search box, and settle for one of the top 10 results. Judging by how often I see the image below on TowardsDataScience, I assume most writers do the same:

Photo by Patrick Tomasso on Unsplash — Overused on TowardsDataScience in articles about NLP.

Text-to-image models provide a glimmer of hope for my problem. While it is unreasonable to expect non-visual keywords like “pre-training” and “language” to match anything inspiring in a stock photo database, one can hope that the gazillion connections across the hundreds of millions or billions of parameters hidden inside a machine learning model will surface some sort of smart visual metaphor. So I prompted several text-to-image models (Midjourney, Craiyon and DALL·E 2) with the title of one of my articles, “Trends in Model Pre-training for Natural Language Understanding”:

Image generated by Midjourney for the prompt “Trends in Model Pre-training for Natural Language Understanding”
Image generated by Craiyon (Mega) for the prompt “Trends in Model Pre-training for Natural Language Understanding”
Image generated by DALL·E 2 for the prompt “Trends in Model Pre-training for Natural Language Understanding”

The results are… thrice disappointing. Any connections that I try to make between my prompt and the generated images seem far-fetched. Is Midjourney latching onto the synonymy between trends and patterns? Is Craiyon showing data scientists analyzing trends? DALL·E 2 in particular seems to have given up on any attempt to be witty.

The model would have to do something almost against its nature: to find a correlation between the title and a visual entity that is statistically unlikely enough to spark poetic joy — the best metaphors are, by definition, unexpected.

Admittedly, my expectations are unrealistic. These models were trained on <caption, image> pairs where the caption is, most of the time, explicitly descriptive of the contents of the image. Illustrating abstract concepts requires many skills in addition to being able to depict a scene. First, the model needs to understand the user intent (in my case, I’m looking for a witty metaphor instead of a literal depiction). Second, the model would have to do something almost against its nature: to find a correlation between my title and a visual entity that is statistically unlikely enough to spark poetic joy — the best metaphors are, by definition, unexpected.

I personally find it daunting to express these requirements solely through a text prompt. Even when users do have a concrete visual scene in mind, writing effective prompts is a non-trivial endeavor, as current models are stochastic black boxes that seem to respond particularly well to certain magic keywords. For instance, adding modifiers like “trending on artstation” or “extreme detail” can have dramatic effects on the quality of the generated image. How much work would I have put in to figure out the magic incantation that turns a dry and abstract blogpost title into an engaging illustration?

Maybe it’s time to resort back to a transfer learning method that is now out of fashion: fine-tuning. In other words, we can take a large model that was pre-trained with a proxy objective on a massive amount of data, and adjust its weights based on a smaller dataset that resembles the end task more closely.

Collecting Data from 128k Medium Articles

Since my end task is to produce illustrations for my Medium blogposts based on their titles, Medium itself is the best source for training data. Publications like TowardsDataScience, The Startup or Better Humans gather articles about technology, career growth or self-improvement, which are inherently difficult to visualize. We collected metadata and images from articles across 3 years (2019 to 2021 inclusive) from 8 popular publications. You can find our dataset on Kaggle (succinctlyai/medium-data) and our scraper on Github (succinctly-ai/medium-scraping).

Fine-tuning DALL·E Mini (Craiyon)

DALL·E Mini (rebranded to Craiyon) is an open source replica of the original text-conditioned image generator model DALL·E published by OpenAI in January 2021 (which remains, until today, closed to the public). Note that this is a precursor to the more recent DALL·E 2 model announced in April 2022, which was recently made available to its first million users. While the two versions carry the same name, they are structurally very different. The former uses a Transformer architecture end-to-end, while the latter chains together a CLIP encoder and a Diffusion-based decoder.

A Note on Quality

Regardless of the specifics, it is important to keep in mind that Craiyon does not use the state-of-the art architecture for text-to-image models. In addition to emulating an older generation of text-to-image models, it does so at a smaller scale: compared to DALL·E, it is 30x smaller (400M parameters vs 20B), and was trained on 8x fewer images (30M vs 250M). This is partly the reason why it cannot handle certain categories of images very well (for instance, human faces). This is just a disclaimer to justify the relatively modest quality of Craiyon’s generated images in comparison with the jaw-dropping photorealistic creations of DALL·E 2. Nonetheless, Craiyon is an extremely useful artifact by virtue of being open sourced: anybody can try out the demo or download the model and experiment with it — and so did we.

Before Fine-tuning

First, we visualized at scale how Craiyon handles abstract blogpost titles out of the box. We randomly sampled 100 blogposts from The Startup publication and prompted it with their titles. A quick visual inspection of the generated images shows a clear trend: it often produces generic flat illustrations with a teal background, as shown below. While this table does cherry-pick the titles that elicit this particular behavior, they are certainly not uncommon across the 100 randomly-chosen data points.

Another note: Craiyon comes in two sizes Mini and Mega. Our experiments use the former, since it is easier from an engineering perspective (Mini fits on a single-host TPU). All the illustrations that follow use the Mini version. In contrast, the demo on craiyon.com runs inference on the Mega version.

Images generated by Craiyon (former Dalle-Mini) when prompted with article titles from The Startup publication. These entries were cherry-picked to emphasize the frequency of teal-background flat illustrations.

A reasonable hypothesis is that the prevalence of this type of illustration is an artefact of the training set. To get some insight into this curious behavior, we used CLIP (an image encoder from OpenAI) to find similar images and their captions in one of Craiyon’s training sets, Conceptual 12M. This revealed a series of photos from Shutterstock (like the one below) that have a similar style to the model generations. It is likely that the blogpost titles are in the same language register as the captions of the corporate-themed illustrations in the training set; this strong association is driving most generated images towards the same visual space. In other words, the titles fall into a small monotonous semantic space where there is little nuance or diversity.

Image from the Conceptual 12M dataset used to train DALL·E Mini. Its caption in the dataset is “Stress at work concept flat illustration. Stressed out women in suit with glasses, in office at the desk. Modern design for web banners, web sites, printed materials, infographics. Flat vector.”

The Fine-tuning Procedure

We fine-tuned the open-sourced Craiyon model (the Mini size) on 128k pairs of <blogpost title, heading image> pairs from our dataset on Kaggle (succinctlyai/medium-data), using a TPU v3–8. We kept most of the hyperparameters listed in the original technical report, but reduced the batch size significantly (16 per device, with a single gradient accumulation step). You can try out our interactive demo at succinctly/dalle-mini-vs-finetuned-medium.

Here are the results for the same set of titles from The Startup publication:

Images generated by our model (Craiyon fine-tuned on Medium articles) when prompted with the same titles as above (articles from The Startup publication). These prompts were not part of the training set. In contrast with Craiyon, the generated images are more diverse.

While the quality of these illustrations is still lacking (inheriting Craiyon’s original limitations), there is a much larger diversity across the generated images. The tiny teal-themed visual space is now expanded to capture more nuance from the title. For instance, the 5th row (“Upgrade Your Local Docker-Compose Development with Dozzle”) is no longer populated with standard corporate suits, but rather with something that alludes to Docker’s blue whale logo. Generally, the model breaks out of the flat illustration niche and incorporates more photographic imagery attempts.

This experiment proves that Craiyon’s initial training set (30M) is not large enough to exhaustively cover the entire semantic space. By fine-tuning it on a niche of interest, we are able to fill some of the gaps and add more nuance. It is hard to predict whether the same holds true for DALL·E 2, whose training set is an order of magnitude larger (650M). But it’s plausible to imagine that even the almighty could use some fine-tuning if the input space is significantly different from standard natural language descriptions (e.g. a combination of free text and serialized HTML code) or the target output space is far enough from typical illustrations (e.g. app or website mockups).

Back to Abstract Concepts and Metaphors

To close the loop, let’s get back to the original title that started this quest: “Trends in Model Pre-training for Natural Language Understanding”. To make a fair comparison, we will look at illustrations generated by (a) Craiyon Mini out of the box, and (b) our Craiyon Mini fine-tuned on Medium articles:

Image generated by the out-of-the-box model (Craiyon Mini) when prompted with “Trends in Model Pre-training for Natural Language Understanding”
Images generated by our model (Craiyon fine-tuned on Medium articles) when prompted with “Trends in Model Pre-training for Natural Language Understanding

The result is… well… fine. While we did manage to break out of the teal-background flat illustration prison, and the output of the fine-tuned model does allude to code and patterns, it isn’t a particularly witty metaphor either. But this is evidence that further work (more training data, or a different approach like soft prompting) could yield even better results.

The larger learning is that fine-tuning as a technique might not be completely obsolete just yet; together with prompt engineering, it seems like a useful tool to have in the machine learning toolbox, even in the context of massive pre-trained text-to-image models.

--

--