Kandinsky 3.0 — a new model for generating images from text

11 min readDec 15, 2023

Without a sense of modernity, an artist will remain unrecognised.
Mikhail Prishvin

Last year, we introduced Kandinsky 2.0, the first diffusion-based multi-lingual text-based image generation model that can generate images from Russian-language text. This was followed by new versions — Kandinsky 2.1 and Kandinsky 2.2 — which differed significantly in quality and features from version 2.0, and were major milestones for our team on the way to achieving better generation quality.

But infinity is not the limit, and there is always room to grow. The number of scientific papers and engineering solutions related to generation of image or text is increasing, and new generation problem statements are emerging — now including video, 3D objects and (even) 4D generation. The field of generative learning is capturing more and more space in the information field. Recently, our team introduced the Deforum-Kandinsky approach to generate animated videos based on the basic image-to-text generation model, and an approach to create zoom-in / zoom-out videos based on the Kandinsky Inpainting model. In parallel with this release, we are also introducing Russia’s first end-to-end text-based video generation model, Kandinsky Video.

Despite this, the task of generating images from text still continues to pose serious challenges to researchers. Each year they become more and more difficult, as today images are required to be unprecedentedly realistic, satisfying the demands of the most fastidious and witty user.

So, one year after the release of our first diffusion model, we present a new version of our text-based image generation model — Kandinsky 3.0! This is the result of a long period of work by our team, which we did in parallel with the development of versions Kandinsky 2.1 and 2.2. We’ve done a lot of experimentation on the choice of architecture and done a lot of work with data to make text understanding and generation quality better, and the architecture itself simpler and more concise. We have also made our model more “domestic”: it navigates the Russian cultural field much better.

In this article, we will briefly describe the key points of the new architecture, the procedure of its training, the strategy of working with data, and, of course, demonstrate the capabilities of our model on the example of generations.

Model architecture and learning

Kandinsky 3.0 is a diffusion model for text-based image generation (like all models in the Kandinsky 2.X). The goal of training such a model is to learn to reconstruct a real image that was noisy during the forward diffusion process. When training Kandinsky 3.0, we moved away from the concept of two-stage generation, which was used in previous versions. Let me remind you how two-stage generation works:

Diffusion Mapping (Image Prior) — generates a latent picture vector obtained by encoding a picture by the visual part of the CLIP model, taking as input a latent text vector encoded by the text part of the CLIP model.
Decoder (U-Net) — generates a picture (to be more precise ) by the latent picture vector of CLIP.

In Kandinsky 3.0, image generation is done directly from encoded text tokens. This approach simplifies training, as now only one part of the model (namely Decoder) needs to be trained. This approach also greatly improves text understanding, because now we can use a powerful language model trained on a large corpus of high-quality text data rather than the CLIP text encoder, which was trained on rather primitive text that is very different from natural language.

In addition to updating the text coding approach, we did a very large study of the U-Net architecture responsible for removing noise from a picture. The main dilemma was which type of layers would contain the bulk of the network parameters: transformer layers or convolutional layers. When trained on large amounts of data, transformers perform better on images, but the U-Net architectures of almost all diffusion models are predominantly convolutional. To resolve this dilemma, we analysed different architectures and noted the following models for ourselves:

ResNet-18, ResNet-50 is a well-known architecture to everyone, but with a noteworthy point. The convolutional blocks in the small version differ from the convolutional blocks in the large version by the presence of a bottleneck responsible for compressing the number of channels before processing the tensor by 3x3 convolution. This allows to reduce the number of parameters and therefore increase the depth of the network, which in practice gives better results in training.

CoAtNet is an architecture that combines both convolutional and attention blocks. Its main idea was that at the initial stage the image should be processed by local convolutions, and its already compressed representation — by transformer layers providing global interaction of image elements.

MaxViT is an architecture based almost entirely on transformer blocks, but adapted to work with images by reducing the quadratic complexity of self-attention.

The idea to use classification models was inspired by the fact that many good architectural solutions are taken from models that show high quality on the ImageNet benchmark. However, our experiments revealed that quality transfer works ambiguously. The MaxVit architecture, which is the best on the classification task, does not perform so well on the generation task after turning it into U-Net. Having thus investigated all the above architectures, we settled on the ResNet-50 block as the basic U-Net block, supplementing it with another convolutional layer with a 3x3 core, borrowing this idea from BigGan’s paper.

In the end, the Kandinsky 3.0 architecture came out of three main parts:

Flan-UL2, the language model of the Encoder-Decoder architecture. For text encoding, we took only Encoder, which is half of the parameters of the whole architecture. In addition to pre-training on a corpus of texts, this version was also pre-trained in SFT style on a large corpus of language tasks. Our experiments showed that SFT significantly improves the image generation. The language model was completely frozen during the training of the mapping part of the model.
U-Net with the architecture illustrated below, consisting predominantly of BigGAN-deep blocks. This made the architecture twice as deep compared to other diffusions based on conventional BigGAN blocks, while keeping the same number of parameters.

As an autoencoder we used Sber-MoVQGAN, which proved itself in previous versions.

Data

The training utilised many text-picture pairs collected from the internet. These data were subjected to numerous filters: image aesthetics, image-to-text matching, duplicates, resolution, and aspect ratio. Compared to Kandinsky 2.2, we extended the sets used, enriched them with new data, added Russian entities, and added images whose descriptions were generated using state-of-the-art multimodal models.

The training process was divided into several stages, which allowed us to use more training data, as well as to generate images of different sizes.

256 × 256: 1.1 billion text-picture pairs, batch size 20, 600k steps, 100 A100
384 × 384: 768 million text-to-picture pairs, batch size 10, 500k steps, 100 A100.
512 × 512: 450 million text-picture pairs, batch size 10, 400k steps, 100 A100
768 × 768: 224 million text-to-picture pairs, batch size 4, 250k steps, 416 A100
Mixed resolution: 768 ≤ width × height ≤ 1024, 280 million text-picture pairs, batch size 1, 350k steps, 416 A100

Generation examples

A beautiful landscape outdoors scene in the crochet knitting art style, drawing in style by Alfons Mucha

Car, mustang, movie, person, poster, car cover, person, in the style of alessandro gottardo, gold and cyan, gerald harvey jones, reflections, highly detailed illustrations, industrial urban scenes

beautiful fairy-tale desert, in the sky a wave of sand merges with the milky way, stars, cosmism, digital art, 8k

Abstract painting composed in yellow and red, black and white and green shades, in the style of red and orange, abstract figurative master, igbo art, frenzied action painting, Australian native, kangaroo, cactus frayed, angura kei

white background image and Daz3d style inflatable Kitty cat sweating doll, simplified Kitty cat image, ultra high definition image, transparent/ semi transparent medium, 8k, c4d, oc, blende

a yellow house at the edge of the danish fjord, in the style of eiko ojala, ingrid baars, ad posters, mountainous vistas, george ault, realistic details, dark white and dark gray, 4k

dragon fruit head, upper body, realistic, illustration by Joshua Hoffine Norman Rockwell, scary, creepy, biohacking, futurism, Zaha Hadid style

purple flower sitting on top of a lush green field, inspired by Mike Winkelmann, cactus, cute c4d, sea punk, pink landscape, polished pristine waters, enchanting dream, dreams. instagram, desert oasis, cgsocciety, digital art, 3d rendering, 4k

Comparison results and generation examples

To compare models, we assembled our balanced set of 2100 prompts across 21 categories and compared different Kandinsky 3.0 weights to select the best one. To do this, we conducted three side-by-side runs with 28 markers. Then, when the best version of the Kandinsky 3.0 model was selected, a side-by-side comparison was conducted with the Kandinsky 2.2 model. Twelve people participated in the study and voted a total of 24,800 times. To do this, they developed a bot that displayed one of 2,100 pairs of images. Each person chose the best image based on two criteria:

relevance to the text,
the visual quality of the image.

Comparisons were made both in terms of visual quality and text comprehension in total across all categories, and for each category separately:

Below are examples of popular model generations compared to Kandinsky 3.0:

A highly detailed digital painting of a portal in a mystic forest with many beautiful trees. A person is standing in front of the portal.

A 4K dslr photo of a hedgehog sitting in a small boat in the middle of a pond. It is wearing a Hawaiian shirt and a straw hat. It is reading a book. There are a few leaves in the background.

Extravagant mouthwatering burger, loaded with all the fixings. Highlight layers and texture

A bear in an Russian national hat with a balalaika

Inpainting + Outpainting

Our team has done separate work on developing inpainting/outpainting models for the Fusion Brain website, with the help of which you can edit images: change necessary objects and whole areas inside the image (inpainting approach), or expand them with new details up to huge panoramas through outpainting approach. The task of inpainting is much more complex than standard generation, because models have to be learnt to be generated not only from text, but also use image context.

To train the inpainting part of the model, we used the GLIDE approach, which has been previously implemented in the Kandinsky family of models as well as in the Stable Diffusion family of models: the input layer of U-Net is modified so that the input can be additionally accept image latent and mask. Thus, U-Net takes as many as 9 channels as input: 4 for the original latent, 4 for the image latent and an additional channel for the mask. From the point of view of modifications in general everything — further training does not differ from the training of a standard diffusion model

An important feature of the task is how the masks are generated and what text is used for training. Users can either draw a mask with a brush or a new image via outpainting. To take into account the way the user works, during training we created special masks that mimic their behaviour: arbitrary shaped brush-painted masks, object masks, and image filling

As a result, the model copes well with image replacement as well as image augmentation (see examples)

Inpainting examples

Outpainting examples

a futuristic cityscape at sunset with towering skyscrapers

a serene beach sunset with palm trees and gentle waves

a mystical forest with towering ancient trees and glowing mushrooms

Deforum

With the introduction of Kandinsky 3.0, we’ve also updated Deforum, a technology that allows us to generate animated videos through an image-to-image approach.

The main difficulty in adapting the framework to the new model was the difference in the modes of noise addition in the diffusion process: Kandinsky 2.2 adds noise according to a linear schedule (top picture), while Kandinsky 3.0 adds noise according to a cosine schedule (bottom picture). This feature required a lot of experiments for adaptation.

Animation examples

Beautiful woman, dark hair, freckles, floral crown of large peonies and roses, beautiful gradient pink background, overhead lighting, professional photography, studio photography, 4k; Mode: “live”

extreme detail, 8k, ultra quality, masterpiece, depth of field, smooth lighting, illustration, very cute realistic cheburatka in a jacket with a tangerine in his hands, intricate sharp details, b dimensional lighting, incredibly detailed scale, incredibly detailed eyes, big ears, incredibly detailed close-up view, rainbow light, detailed clear coat, snow, winter city, fireworks, lights, sparklers, garlands, joy, laughter, smile, kindness, happiness

Conclusion and plans

We introduced our new text-based image generation architecture — Kandinsky 3.0. We have significantly improved our understanding of text and Russian culture compared to our previous models, and we will definitely continue to move in this direction. On the scientific side, our plans include the creation of another new generation model, which will have to say its new word in the AI-arena.
The field of artificial intelligence and generative learning opens up a wide space for further development and, who knows, maybe in the near future models like our Kandinsky will form a new reality — not much different from the present one. What the implications of these changes will be for humans is difficult to judge, and risks slipping into a lot of dubious speculation. As researchers, we would caution against both overly pessimistic and optimistic predictions. But what we are sure of is that this development will in any case be very interesting and will require a change of perspective on many things around us. We, all of humanity, have yet to realise the full power of generative learning. Stay tuned so that you don’t miss how the world will change, including through our efforts!

Authors and their contributions

The Kandinsky 3.0 model was developed by the Sber AI team with the partner support of scientists from the AIRI Institute for Artificial Intelligence using a combined dataset from Sber AI and SberDevices.
The team of authors: Vladimir Arkhipkin, Vyacheslav Vasiliev, Andrei Filatov, Anastasia Maltseva, Said Azizov, Arseniy Shakhmatov, Igor Pavlov, Mikhail Shoitov, Yulia Agafonova, Sergey Nesteruk, Anastasia Lysenko, Ilya Ryabov, Angelina Kuts, Sofia Kirilova, Sergey Markov, Andrey Kuznetsov and Denis Dimitrov.

Useful links

Fusion Brain. (text-based generation and inpainting/outpainting modes are available, as well as animation and video generation)
Telegram-bot (generation by text is available)
Rudalle (generation by text is available)
Gigachat (you can call generation via dialogue)
Telegram-bot для видео генерации (animation and video generation available)
HuggingFace (code, model weights and example runs will be available)
Github (code, weights and model run examples are available)
Github KandinskyVideo (code, weights and model run examples are available)
Deforum (animation generation page)