How we trained YandexART to create images that people like

Published in

Yandex

15 min readApr 27, 2024

My name is Sergey Kastryulin, and I’m part of the Yandex Research team. My work focuses on computer vision and generative neural network research. In this article, I’ll tell you about the approaches behind YandexART — an image generation technology that creates images and animations from textual descriptions. YandexART powers the Shedevrum app and is available in Yandex Ads and Yandex Market. I’ll explain how we improved the efficiency of this neural network and evaluated the quality of its outputs. This article will interest experts and those looking to test the API in the cloud and incorporate image generation modules into their services and web applications.

A brief overview of YandexART’s evolution

The first experiments with image generation models at Yandex Research began two years ago. Back then, our department had accumulated some expertise in diffusion models, and the industry saw inspiring examples of what such generative technologies could do: first Imagen and DALL-E 2, then Midjourney, Stable Diffusion, and others. Over these two years, we’ve gone through several stages in our journey.

The initial experiments began with a latent model, a precursor of Stable Diffusion. As our initial data sample, we used a dataset created from user queries in a search engine. Essentially, we had a set of raw “text-image” pairs taken from the internet. However, this dataset had its flaws. The data was “noisy” and low-quality, with poorly interconnected images and text. As a result, the model trained on these underperformed across most metrics.
Then, we switched to open datasets right around when LAION emerged. Looking back, we realize that this dataset was far from perfect. But at the time, it was the only option, and with LAION, we achieved initial results that we were pleased with regarding generation quality.
Having gained some experience, we switched to our dataset-building pipeline. From then on, open datasets only comprised a small part of our entire dataset.
Moving forward, we started some serious experiments with cascaded diffusion.

A couple of years of experimentation helped us define what constitutes a high-quality dataset for generating beautiful images. In the process, we also established our criteria for this “beauty.” But beyond that, we were also constantly addressing the challenge of resource optimization. Diffusion models are often trained on large datasets, a process that demands powerful GPUs and a significant amount of time. Our focus has been continually exploring the balance between data quality and quantity during the pre-training stage, allowing for scalability through training on small sets of high-quality images.

Shedevrum is currently powered by YandexART 1.2, essentially the seventh iteration of the model (based on our internal numbering). We also use this neural network in many Yandex services and tasks:

In Yandex Business, it helps you choose a ready-made image from the neural photo stock or generate a new picture in one click.
In Yandex Browser, it helps you generate images when speaking with Alice, the virtual assistant.
In Yandex Market, we used the Outpainting mechanism to test the background generation feature for product cards.

*This is how Outpainting works in YandexART*

In closed testing, one major e-commerce network is currently testing the YandexART API for designing gift cards.
We also use generated images in other ML tasks, such as data augmentation.

Now let’s take a look at what’s under the hood of our image generation system. I’ll share our most exciting breakthroughs, how and why the training data had to be heavily filtered, and how we worked on optimizing the model’s performance to meet our output requirements.

Finding the right approach to architecture

In terms of architecture, we had two approaches to choose from, and we tried both:

Latent model. This model uses a pre-trained variational autoencoder (VAE), which has an internal representation with a spatial resolution (64×64, for example) and several channels (usually four or 16). You can generate a latent code in this space and decode it into a high-resolution image. Stable Diffusion XL and DALL-E 3 are examples of networks that use this paradigm.
Cascaded diffusion. Here, several models work sequentially: the pixel Text-to-Image model generates a low-resolution image, and subsequent models in the cascade upsample it to the desired size. This is how Imagen from Google and DALL-E 2 work.

The first approach is quite common, but image quality here is limited by the original performance of the VAE, which is often low. The VAE can be further trained to improve the quality, but this is only sometimes effective due to the following two factors. First, by using only one upscaling model, we are forced to train it for high upscaling factors (8x upscaling is typical today). This means the model should reconstruct 64 pixels from each latent code, which is challenging. Second, the task should be performed by a model with a relatively small number of parameters. Otherwise, working with large activation maps in layers close to the target resolution will cause video RAM issues.

Furthermore, the latent approach often leads to the common issue of Oversaturation: the images may appear excessively bright, almost overexposed, and can look non-natural. You also have to deal with color balance issues.

Multichannel autoencoders are good at reconstructing images, but training diffusion on them is more challenging.

In a cascaded approach, the decoder function is performed by Super Resolution models, which operate independently of the primary pixel-space generative model. This autonomy allows for the natural interchangeability of the super-resolution blocks. You can train each model independently, breaking the high-level upscaling task into several intermediate pieces. This provides flexibility in allocating computing resources and simplifies the experimentation process. In addition, each cascade stage solves a specific task, so you can pick models and datasets that best fit each stage.

Thus, we settled on a cascaded approach and, through trial and error, arrived at a cascade of three elements:

First, we generate a 64×64 pixel image based on the prompt. To do this, we apply the GEN64 model, which follows the U-Net architecture and is conditioned on text input via the Cross-Attention mechanism.
Then, we use the SR256 model to upscale the image resolution to 256×256. A U-Net-like architecture also powers this step, and this model is conditioned on text. The first 64×64 image and prompt serve as the condition for generation. This allows us to take the user’s text into account when upscaling. You can use it to add details, improve something, or correct the errors of the first generation.
In the third stage, we upscale again to the size of 1024×1024. Here, we use the Efficient U-Net architecture, which no longer considers text and has fewer parameters in the upper layers. We no longer need to generate large elements, which helps reduce the computational complexity and batch size.

Like in Imagen, we use pre-trained text encoders for conditioning on text. For this purpose, our model has 1.3B parameters based on the BERT-xlarge architecture, which we call i2t (image-to-text) within our team.

This is our equivalent of CLIP (Contrastive Language-Image Pre-Training) — a deep learning model designed to understand the relationship between images and text. Initially, this model was created for search purposes and was developed over many years.

Data preparation is the most crucial factor affecting training and determining the neural network’s performance quality. So, let me give you a step-by-step summary of our approaches to datasets, training itself, and evaluating data quality in terms of aesthetics, image-text relevance, and absence of artifacts. Let’s start with a large dataset combining publicly available and proprietary datasets used for pre-training.

Data curation strategy and quality assessment

At first, we had a massive dataset of almost a trillion “image-text” pairs, almost equivalent to a dump of the entire visible web. However, our initial experiments revealed that training models even on a billion low-quality samples was ineffective. To improve the quality of our dataset, we had to curate the best pairs. But to do this, we needed to understand what we consider “best” in the first place.

Experimentally, we deduced that image quality, text quality, and text-to-image relevance should be evaluated separately. We designed a multi-stage filtration process for the source data. For filtering, each pair should have a pre-calculated set of predictors that are classifiers trained on different features of images and texts and associated with some quality characteristic of the sample.

Image filtering

First, we used our classifiers to remove inappropriate, indecent, or otherwise questionable images from the dataset. Removing unwanted content is the most reliable way to safeguard future generations. Making unwanted content unknown for the model eliminates the possibility of its generation in the future.
Then, we used the SAC (Simulacra Aesthetic Captions) dataset for rough filtering. An ensemble of predictors based on images with weights and fine-tuned on SAC predicted the overall attractiveness of the image to humans. We tested these predictions against independent human assessments during our experiments and generally found a good match. Based on this criterion, we retained only the top third of the images, as most pairs below this threshold appeared unattractive or ambiguous to the evaluators.
In the final set, we only included images with dimensions in the range of [512, 10240] pixels and an aspect ratio in the range of [0.5, 2]. As confirmed by preliminary experiments, such images were the most useful for model training, and this simplified the data preparation pipeline.
Next, we used the classifiers that focused more on the technical quality of the image. We had several parameters: noise level, blurriness, presence of watermarks, checkered background, and degree of compression.
We understood the importance of aesthetic appeal from the first experiments, so we used aesthetic classifiers trained on the public AVA and TAD66k datasets. Threshold values for filtering were picked manually, maintaining the balance of quality and size of the dataset.

In the example below, you can see why we needed several classifiers trained on different datasets right away:

For example, one of the aesthetic classifiers “loves” nature and doesn’t classify non-nature images as “beautiful.” If we only use the data from its predictions, the model won’t learn to generate anything other than nature, and we wouldn’t want that

6. We also wanted our model to generate complex, highly detailed scenes. We could select data based on the Image Complexity classifier we trained on the IC9600 dataset.

7. We also explicitly monitored the monotony of the background. To do this, we divided the dataset into two parts: one with a monotonic background and the other with a more “interesting” background. After experimenting, we decided that only 10% of our training data would contain a monotonic background. We wanted the model to generate exciting and diverse backgrounds in the first place. However, we still wanted it to preserve the ability to create images with a monotonic background.

So far, we’ve covered just the visual selection criteria. We continue to filter down further.

Text filtering. Here, things were even worse somehow. The original dataset was derived from texts found alongside images on the internet. As a result, texts that were only partially relevant could end up in the dataset, along with irrelevant technical information and hashtags. At the same time, we needed to make the text similar to the search query that brought users to the image.

First, we focused on English texts, using a proprietary language classifier to identify the language of the text. Then, we manually annotated a random sample of around 4.8K texts, adding either a cleaned-up version of the original text or an empty label (which meant the text was unsuitable for training). We then filtered out the lines with empty labels and used the cleaned-up texts to form a training dataset. Finally, we fine-tuned a small language model with 180M parameters on this dataset and used its predictions as a text quality factor. All pairs containing non-English or incorrect text (as recognized by the classifier) were removed from the dataset.

Combined approach. By this point, we had a relatively low-quality set of 2.3 billion image-text pairs that required further filtering. We manually annotated 66K pairs to refine the dataset further, reflecting the images’ visual appeal and the texts’ relevance. We used a rating system on a scale of one to three: good, okay, and bad. Unlike the frequently used Likert scale, this simplified scale allowed us to find the right balance between informativeness and diversity of ratings from real people.

After that, we trained the CatBoost model on 56 factors, including six variations of the CLIP Score, 38 text-only, and 12 image-only factors. We ended up with a model that evaluates how suitable the data sample is for training the model. We called it the Sample Fidelity Classifier (SFC).

We sorted all the pairs from the previous filtration stage according to the SFC model’s forecasts and selected the best pairs. Ultimately, we set the threshold values so that the final dataset included 300M images with non-monotonic backgrounds and 30 million images with monotonic backgrounds.

We arrived at this data preparation scheme through trial and error and numerous experiments. After that, we started pre-training models in the cascade.

Model Training

Our architecture has three models in the cascade.

GEN64. The primary generation model has 2.3 billion parameters (2.3B model).

SR256.The first upsampling model with 700 million parameters.

SR1024. The second upsampling model, also with 700 million parameters.

We used one dataset to train GEN64 and SR256 and a different dataset to train SR1024. We also trained the SR1024 model on a dataset with a slightly different set of classifiers because we wanted to pay more attention to the technical quality of the images: the absence of noise, blurring, and compression artifacts.

Further training aimed to enhance the model’s specific characteristics, such as adherence to the prompt or the aesthetic quality of generated images. Here, we also needed two datasets: one for the GEN64 and SR256 models and another for SR1024.

Preparing the dataset for fine-tuning

We leveraged ML models and help from assessors to improve the dataset.

First, we filtered the initial dataset using a set of classifiers and reduced its volume to several hundred thousand pairs. We tried to pick images from different categories: nature, goods, interiors, cars, food, etc.
We then asked the assessors to filter out defective images where the object’s size, the position of the limbs, or the expression on a person’s or animal’s face seemed unnatural.
After that, we also asked the assessors to rephrase the descriptions: remove unnecessary words and provide detailed descriptions of objects, their characteristics, actions, interactions between them, their background and surroundings, and the style of the image.

The resulting 50K pairs were extremely high quality, especially regarding their relevance between image and text. We used these pairs for Supervised Fine-Tuning, significantly improving the model’s adherence to the prompt. But we still felt that this wasn’t enough.

Numerous studies show that fine-tuning diffusion models on a cleaned-up curated dataset significantly improves the quality of generated images. However, other methods are needed to enhance the images further. In the final stage, we used Reinforcement Learning (RL) to improve aesthetic quality and reduce defects in image generation.

How can we encourage the neural network to generate more beautiful aesthetic images?

In the Reinforcement Learning task formulation, we adopted the DDPO approach and used a PPO loss with ε = 0.5. This is significant because we observed that ε is typically assumed to be 0.1 to mitigate the noise in the models’ “reward” values.

However, we discovered that using ε = 0.5 can accelerate fine-tuning of image-related tasks, considerably speeding up the overall process.

We used three reward models:

The image-text relevance (Relevance) in terms of OpenCLIP ViT-G/14
“Absence of defects” (Consistency).
“Beauty” (Aesthetics).

The second and third points are our reward models based on user preferences. We also used manual data annotation here, calculated losses independently for each “reward,” and then calculated the weighted average. Our goal was to enhance the aesthetic quality and reduce defects in the generated images without compromising relevance. Experimentally, we found that weighting, which scales the reward values of models to the same level, meets this requirement.

The left side plot illustrates how the reward for beauty (Aesthetics) and the absence of defects (Consistency) increases during the fine-tuning process while Relevance remains almost unchanged. Additionally, from a user preference standpoint, the model becomes more favorable (on the right)

Moreover, we could trace a dependency between the assessors’ ratings and the reward change metrics during the reinforcement training stage. In other words, the assessors’ ratings confirmed that using Reinforcement Learning in the final stage of training improved the overall quality of generations.

How to evaluate generated image quality

After all the training stages, we must ensure we’ve achieved the desired result and can generate beautiful images. To do this, we need clear and reliable quality criteria. We measured quality using commonly accepted automatic metrics (FID/CLIP score) and our manual annotation tool.

Automatic ratings. The FID and CLIP score metrics are often used to measure both the intermediate (during model training) and final quality of generations. Previous studies (here and here) demonstrate that these metrics correlate poorly with human judgment. And we’ve confirmed this observation in practice. For example, from a certain level of quality, FID was practically unable to distinguish between moderately good and very good models. The graph below shows that the model that wins the baseline only 20% of the time has a slightly lower FID score than models with approximately 50% wins.

That’s why we considered real-life ratings to be the leading quality criterion. We also needed to establish a strict procedure that would allow us to obtain intuitively understandable, interpretable, and statistically significant results.

Human evaluation. For the final assessments, we sought the participation of non-expert individuals from a crowdsourcing platform. All candidates underwent preliminary training, and only individuals scoring at least 80% on a test with 20 pre-prepared tasks were selected to provide assessments.

All participants were required to compare side-by-side images generated by different neural networks from two prompt sets:

DrawBench consists of 200 prompts and has already become the de facto standard for evaluating models that create images based on text.
Our YaBasket-300 set contains prompts divided into the “Common Sense” and “Products” categories, as well as our own set of prompts that complement publicly available benchmarks with scenarios that are important from a practical standpoint.

*The contents of the YaBasket prompt set are the ratio of top-level categories (a). The product section is then divided into eight almost equally sized subcategories (b)*

We showed each participant image pairs generated by different models based on a single prompt. The participants didn’t know which models they were comparing. After that, they were asked to choose one of the images based on three evaluation criteria in order of importance:

Presence of defects: for example, distortion of objects, limbs, or faces.
Image-to-text relevance.
The aesthetic aspect includes the choice and balance of colors, how beautiful the background is, and so on.

If the images were identical across these three criteria, the participant marked them as equivalent in quality. Three participants labeled each image pair, and the image with the most votes would receive a point.

According to assessors, the YandexART model surpassed Stable Diffusion XL (SDXL) in 77% of cases and Kandinsky v3 in 72%. The model generated images that were almost equivalent in quality to the MidJourney v5 model.

Results for YaBasket-300, October–November 2023:

Results for DrawBench, October–November 2023:

While developing the YandexART 1.2 model, we extensively researched the GEN64 stage and the primary dataset for the model’s training. For example, we checked if it was possible to achieve the same quality using a model with fewer parameters but a more extended training process. We compared the training efficiency of models of different sizes and investigated how the dataset size affects the final quality. The answers to all these questions make little sense if the quality of the pre-trained model is weakly related to the quality of fine-tuning, so we also studied this relationship separately. You can read about these and other experiments in our article on Arxiv.

Our work on developing generative technologies at Yandex is ongoing. Currently, the YandexART team is actively developing a new generation of models that are even better at solving user tasks. For example, we recently decided to take another look at latent diffusion. In addition to the drawbacks described above, it also has significant advantages. Such models can work faster because they don’t need computationally complex diffusion upscale. They’re also easier for related tasks such as video generation and image blending.

We also continue to explore ways to improve datasets, use synthetic texts, and select only beautiful images. We experiment with multiple text encoders, model architectures, and sizes. One of the results of our experiments is the beta version of the YandexART 1.3 model, which is already available to the general public in the Shedevrum app and will soon be available in the Foundation Models service.

Another important point is that Yandex Cloud users have had access to testing the YandexART API since April 9. You can test the API in the Foundation Models service, which offers several machine learning models, including YandexGPT for text generation and embeddings for semantic search tasks.