How AI image generation works: A fun and beginner-friendly guide š
AI image generation might seem magical, but behind the scenes, itās more like a determined child learning to draw, trying their best with every attempt. Letās dive in.
1. The AI takes in an input from the user in the prompt text box.
For example, the user enters āA banana inside a snow globeā.
2. The AI converts text into numbers
Computers only understand numbers and AI is a computer and it needs something that can help to convert all that text into numbers in some logical manner. So, the AI feeds the text prompt into a text encoder, which then analyzes words from the text and converts them into number or mathematical representations called āembeddingsā
3. The AI gets to work starting with a noisy canvas
Just like how we humans start with a blank canvas, AI starts with a ānoisy canvasā to begin with which looks much like the static noise created by old television screens.
4. Next, the AI reflects on its training journey. But how is the AI trained?
4a. TRAINING: The AI is fed a massive dataset of images with alt texts from the internet.
4b. TRAINING: The AI then passes these images through an image encoder that converts it into embeddings
It derives these embeddings from the pixels of the image, consisting of the three-color values of red, green and blue (RGB). Each color value ranges from 0 to 255.
4c. TRAINING: The alt texts are also passed through a text encoder that converts it into embeddings.
4d. TRAINING: The AI uses all this data to learn and perform deep learning.
Just like how we make sense of the real world by making comparisons and forming relationships, the AI tries to find patterns between the images and their alt texts, and creates variables based on them and that process in AI lingo is referred to as deep learning.
Imagine an AI is tasked with separating šbananas from šballoons. At first, it decides to measure yellowness as the key distinguishing feature. But then it encounters a yellow balloon. Suddenly, yellowness alone isnāt enough to differentiate between the two.
No problem, the AI thinks, it introduces a new variable: roundness. Now it has a two-dimensional space where round balloons cluster in one area, and the more elongated bananas are in another. It seems like the problem is solved, right?
But then comes a twist: what about a round banana or a balloon thatās not perfectly round? Clearly, two variables arenāt enough. The AI needs more information. So, it considers shininess ā since balloons often have a glossy sheen, while bananas are typically matte.
Now the AI has a three-dimensional space, where objects are classified based on their yellowness, roundness, and shininess.
But letās take it a step further. What if we donāt just want to classify bananas and balloons? What if we want the AI to recognize everything ā from apples to airplanes, from cats to hats? Features like yellowness, roundness, and shininess arenāt enough to capture what makes each of these objects unique.
This is where deep learning comes in. Unlike traditional approaches that rely on humans to define features, deep learning algorithms learn these features on their own. As they process vast amounts of data, they automatically discover the variables ā like fur texture for cats or brim curvature for hats ā that are most relevant.
4e. TRAINING: The AI uses these variables to build a multi-dimensional space called the Latent Space.
Latent Space is a multi-dimensional space containing over hundreds of axes defined by various variables. Specific regions in this abstract mathematical space correspond to various properties like textures, colors, compositions etc. The AI maps the text prompt to a specific point in this Latent Space based on what it has learned about associating words to images.
4f. TRAINING: AI has no concept of shame. So, it freely makes mistakes and learns from it through a process called Diffusion.
AI introduces random noise to original images and then learns, step by step, how to adjust each pixel in the noisy version to reconstruct the original image. Itās similar to how we might carefully study an image and attempt to replicate it perfectly through drawing or painting, refining the details with each stroke.
6. Based on its training data, the AI starts rearranging the pixels according to the userās prompt in a process called Reverse Diffusion.
The text embeddings (texts that has been converted into numbers) guide the AI to a location in Latent Space corresponding to the prompt. Then Diffusion is used to translate that point into an image by iteratively removing noise.
7. AI generates an image(s), applies content credentials to each image, and returns the generated image(s) to the user.
The images created by the AI match the original prompt but may not exist in the training data. Hence, slight randomness means identical prompts can yield different images.
Ethical Considerations: Navigating the Future
While understanding how AI works is crucial, we must also recognize its ethical implications.
Firstly, humans are biased, and those biases often seep into the data used to train AI, leading to biased outcomes that may be unfair or unrepresentative.
Secondly, is it fair for AI to be trained on human-created content without the creatorās consent or compensation, only to generate derivative works? These questions remain central to ongoing debates as we strive to balance innovation with fairness and accountability
But like I always say, āthe future is bright if we design it rightā :D
Glossary
Deep Learning: Just as humans learn by recognizing patterns and making connections, AI learns in a similar manner through a process called deep learning.
Diffusion: Much like humans learn and improve through trial and error, AI can learn and refine its understanding through an iterative process called Diffusion where it incrementally adds noise to data like an image and tries to retrieve it back.
Reverse Diffusion: This is the reverse of Diffusion where the AI starts with a noisy canvas and iteratively refines it to generate the ideal output based on the input prompt.
Image embedding: The process of converting image data into numbers that captures the imageās essential features in a way that AI models can process and understand.
Text embedding: The process of converting text data into numbers that captures the essential details from the text in a way that AI models can process and understand.
Latent Space: Latent space can be compared to a humanās intuition or thought process, where we synthesize information based on multiple criteria to make sense of the world. For instance, when we think of something yellow, curvy, soft, and peelable, we instinctively identify it as a banana. Similarly, an AIās thought can be referred to as the latent space where it stores a lot of variables like color, shape, peelability etc. This space allows AI to identify patterns, make connections, and draw conclusions, enabling to identify a banana based on what it looks like.
To learn more, check out
š What is AI Art? How Art Generators Work (2024)
š„ How AI Image Generators Work (Stable Diffusion / Dall-E)
š„ Explained simply: How does AI create art?
šØ Huge thanks to Maitri Chauhan for the adorable illustrations ā they add so much charm!