A Gentle Introduction to Text-to-Image Generation Models

Juan Manuel Reyes
Alphalabs
Published in
9 min readApr 4, 2023

Written by Juan Manuel Reyes

Image from Lexica
Images from Lexica

In recent years, artificial intelligence (AI) has made a great advance in the field of image generation, reaching the point of generating hyper-realistic images that can imitate artistic styles in an unprecedented way, sparking a great debate in the artistic community.

Before we start with current technologies, we will give you a brief introduction to the history of image generation models.

The first image generation models were based on autoencoders, and this structure is still used in the complex neural networks of the most powerful image generation models today. Autoencoders consist of two parts: the encoder, which maps the input data to a code, and the decoder, which reconstructs the original data. Autoencoders trained on images of human faces were able to generate new, realistic faces.

After the models based on autoencoders came the GANs, which are generative adversarial networks. This first proposal was from Ian Goodfellow in 2014, followed by Yann Lecun, Head of AI at Meta, who said it was the most interesting idea in the last 10 years in the field of Machine Learning. The GANs are made up of neural networks where one of them, the generator, is in charge of generating realistic images and the other network, the discriminator, tries to determine if the image is real or false. This process required for GANs is very time consuming; even in 2017, it could take days to get a realistic face.

Finally, in 2015, a scientific paper was published presenting diffusion models inspired by non-equilibrium thermodynamics. These models operate by systematically and slowly destroying the structure in a data distribution through an iterative process of forward diffusion, followed by a process of reverse diffusion to restore the structure of the data.

Today, we can find several tools to generate images through AI, but there are certain tools on the market that stand out for their impressive ability to convert text into images. They can replicate the styles of artists such as Van Gogh, or generate an image that is extremely similar to one that you could take with your own mobile.

In this article, we will mainly discuss three tools: DALL·E 2, Midjourney, and Stable Diffusion. We will focus on Stable Diffusion, as it has some unique characteristics that we will address later.

Overview

DALL·E 2 is the model created by OpenAI. This is an AI program which generates images from a textual description of the desired image. This model is capable of effectively interpreting texts entered in natural language and generating the corresponding image. This was achieved through the use of GTP-3, a model based on transformers with more than 175 billion parameters (this is an AI model intended for natural language interpretation).

Midjourney is an independent research laboratory that explores new ways of thinking and seeks to expand the imaginative powers of the human species. In order to achieve this goal, they developed an image generation tool driven by artificial intelligence. This tool, unlike others, is characterized by its ability to adapt to different real-world artistic styles that can be used to generate images. It is an ideal tool for creating environments, especially in fictional and science fiction settings, with spectacular lighting similar to concept art in video games.

Stable Diffusion is another text-image generation tool created by Stability AI. This tool uses a text encoder called CLIP ViT-L/14, allowing it to fit the model to the entered text. This model separates the image generation process into a “broadcast” process during runtime. The process begins with noise and gradually improves the image until it is completely noise-free and closely matches the image description entered by the user.

These three tools named above are not the only ones that exist in the market; others such as Image from Google or eDiff-I from NVIDIA provide us with good quality results, but these are the ones that have shown the most advances in the field of image generation using text. Although they all allow us to generate personalized images based on a text entered, each one of them has its own characteristics that distinguish them from the others. These tools are very effective for the creation of a visual product, making many of the creative processes that could take many hours of work to be done in just seconds. These processes can range from devising an advertising campaign, creating a product logo, designing buildings, etc.

DALL·E 2

DALL·E 2, as we named above, is the OpenAI text-based image generation tool released in 2022 as a successor to the DALL·E launched at the beginning of 2021. This tool works thanks to two main pillars: one is the Prior, which is responsible for converting the user’s input into a representation of the image, and the other is the Decoder, which is responsible for converting that representation of the image into a real photo.

The images and texts used by DALL·E 2 come from another network called CLIP (Contrastive Language-Image Pre-training), which was also developed by the same company, OpenAI. CLIP is a neural network that returns the best description for an input image; that is, it does the opposite work to that of DALL·E 2, which is responsible for converting a text to an image. The objective of this tool, CLIP, is to distinguish what the visual and textual representations of the same object have in common.

The general objective of DALL·E 2 is to train two models, as mentioned above. The first one, Prior, is responsible for taking the textual description of the image provided by the user and creating CLIP image embeddings. The second model is the Decoder, which takes the previously created CLIP image embedding which will then be enlarged using CNN to produce the final image.

This tool is capable of generating images that can be very realistic or closely imitate different artistic styles with a simple text input.

Midjourney

This is another of the tools we currently have for generating images through text. This tool uses the power of artificial intelligence and machine learning to create incredible images from text-based instructions.

As we mentioned at the beginning of the article, this tool gives us very characteristic results, providing very good results when it comes to generating fantasy environments. One of the greatest advantages of this tool compared to others is how easy it is to access and use it; to generate an image with Midjourney, we only have to join the Midjourney Discord server or, failing that, connect the official Midjourney bot to your own Discord server and enter the command “/imagine” followed by your descriptive message of what you want to generate. After this, the bot will take care of the rest and quickly respond with an incredible image. By clicking on this link, you will be able to access the platform and start using this incredible tool. In case you want to add the bot to your own server, you must follow the steps indicated in this link.

This tool has a problem with access to the bot and its availability, since access to this tool is through a public Discord server. All your creations will be visible to the rest of the server users, which will cause it to lose the privacy and identity of the idea itself. To avoid this problem, you will be able to access a paid subscription which will provide private access for the bot. This will not change the functional part of the tool; it will only make your creations confidential and will only exist in your private chat.

Stable Diffusion

Stable diffusion is an advanced text-to-image synthesis technique that uses latent diffusion models (LDMs) to create images from text. Diffusion Models (DM) are a type of generative model based on transformers that take an input, such as an image, and gradually add noise until the image is unrecognizable. After this step, the model attempts to reconstruct the image to its original form, learning how to generate images and other data in the process.

Diffusion models have a problem: these models, especially the most powerful ones, require an immense amount of resources for their training. They can spend hundreds of days consuming GPU resources to achieve this task, making the training of this type of model very expensive. To mitigate this cost, latent diffusion models are applied in the latent space of powerful pre-trained autoencoders. This strategy allows for an optimal reduction of complexity while preserving details, resulting in a great improvement in the fidelity of the result. The model (LDM) also adds a cross-attention layer to the model architecture, making it a flexible generator for inputs such as text, thus enabling high-resolution convolution-based synthesis.

In short, the Stable Diffusion tool uses latent diffusion models (LDMs) to train a powerful text-to-image synthesis technique, which is much more efficient while preserving detail and producing high-resolution images.

A comparison between DALL-E 2, Midjourney, and Stable Diffusion

DALL-E 2 was trained with millions of stock images, making the images it generates more sophisticated and suitable for business use. It produces better results than Midjourney and Stable Diffusion when there are more than two characters.

On the other hand, Midjourney is very good at generating images with an artistic style, producing results that are more similar to a painting or illustration than a photograph.

Stable Diffusion is an open source model accessible to everyone and has a good understanding of contemporary artistic illustration, allowing it to create very detailed results. This is a great tool for generating complex illustrations with very descriptive text input, but it also falls short when it comes to generating simpler images.

Below images show a comparison of what each tool generates with the same input message:

As you can see, these tools have an outstanding result; however, there is still much to know about them. In this article, we have given a gentle introduction to what these tools are and what results we can obtain. In the following posts, we will do a deep dive into these tools and how we can experiment with our own pictures. If you are interested in the field of text-to-image or any machine learning topics, you can see more content on our website, LinkedIn, Twitter, or Instagram.

Reach Us:

Resources:

--

--