How AI image generation works: A fun and beginner-friendly guide šŸ˜Š

Jishnu Hari
Bootcamp
Published in
7 min readJan 14, 2025
A human stick figure looking at a blank canvas and an AI stick figure looking at a noisy canvas

AI image generation might seem magical, but behind the scenes, itā€™s more like a determined child learning to draw, trying their best with every attempt. Letā€™s dive in.

1. The AI takes in an input from the user in the prompt text box.

For example, the user enters ā€œA banana inside a snow globeā€.

A sample prompt that reads ā€œA banana inside a snowglobeā€

2. The AI converts text into numbers

Computers only understand numbers and AI is a computer and it needs something that can help to convert all that text into numbers in some logical manner. So, the AI feeds the text prompt into a text encoder, which then analyzes words from the text and converts them into number or mathematical representations called ā€˜embeddingsā€™

A string of 0ā€™s and 1's

3. The AI gets to work starting with a noisy canvas

Just like how we humans start with a blank canvas, AI starts with a ā€˜noisy canvasā€™ to begin with which looks much like the static noise created by old television screens.

A noisy/gaussian blur canvas

4. Next, the AI reflects on its training journey. But how is the AI trained?

4a. TRAINING: The AI is fed a massive dataset of images with alt texts from the internet.

Sketch of images with their respective alt-texts

4b. TRAINING: The AI then passes these images through an image encoder that converts it into embeddings

It derives these embeddings from the pixels of the image, consisting of the three-color values of red, green and blue (RGB). Each color value ranges from 0 to 255.

Sketck of the color wheel and how each color pixel in an image corresponds to a number

4c. TRAINING: The alt texts are also passed through a text encoder that converts it into embeddings.

Sketch of a series of numbers

4d. TRAINING: The AI uses all this data to learn and perform deep learning.

Just like how we make sense of the real world by making comparisons and forming relationships, the AI tries to find patterns between the images and their alt texts, and creates variables based on them and that process in AI lingo is referred to as deep learning.

Sketch that represents the concept of deep learning in a simple manner

Imagine an AI is tasked with separating šŸŒbananas from šŸŽˆballoons. At first, it decides to measure yellowness as the key distinguishing feature. But then it encounters a yellow balloon. Suddenly, yellowness alone isnā€™t enough to differentiate between the two.

No problem, the AI thinks, it introduces a new variable: roundness. Now it has a two-dimensional space where round balloons cluster in one area, and the more elongated bananas are in another. It seems like the problem is solved, right?

But then comes a twist: what about a round banana or a balloon thatā€™s not perfectly round? Clearly, two variables arenā€™t enough. The AI needs more information. So, it considers shininess ā€” since balloons often have a glossy sheen, while bananas are typically matte.

Now the AI has a three-dimensional space, where objects are classified based on their yellowness, roundness, and shininess.

Sketch of a graph with roundness on the y-axis and yellowness on the x-axis. A red balloon, yellow balloon and a banana is placed on the graph.

But letā€™s take it a step further. What if we donā€™t just want to classify bananas and balloons? What if we want the AI to recognize everything ā€” from apples to airplanes, from cats to hats? Features like yellowness, roundness, and shininess arenā€™t enough to capture what makes each of these objects unique.

This is where deep learning comes in. Unlike traditional approaches that rely on humans to define features, deep learning algorithms learn these features on their own. As they process vast amounts of data, they automatically discover the variables ā€” like fur texture for cats or brim curvature for hats ā€” that are most relevant.

A sketch to represent the process of deep learning.

4e. TRAINING: The AI uses these variables to build a multi-dimensional space called the Latent Space.

Latent Space is a multi-dimensional space containing over hundreds of axes defined by various variables. Specific regions in this abstract mathematical space correspond to various properties like textures, colors, compositions etc. The AI maps the text prompt to a specific point in this Latent Space based on what it has learned about associating words to images.

A sketch that represents a multi-dimensional Latent Space

4f. TRAINING: AI has no concept of shame. So, it freely makes mistakes and learns from it through a process called Diffusion.

AI introduces random noise to original images and then learns, step by step, how to adjust each pixel in the noisy version to reconstruct the original image. Itā€™s similar to how we might carefully study an image and attempt to replicate it perfectly through drawing or painting, refining the details with each stroke.

6. Based on its training data, the AI starts rearranging the pixels according to the userā€™s prompt in a process called Reverse Diffusion.

The text embeddings (texts that has been converted into numbers) guide the AI to a location in Latent Space corresponding to the prompt. Then Diffusion is used to translate that point into an image by iteratively removing noise.

A sketch of a pixelated banana

7. AI generates an image(s), applies content credentials to each image, and returns the generated image(s) to the user.

The images created by the AI match the original prompt but may not exist in the training data. Hence, slight randomness means identical prompts can yield different images.

Sketch of 4 different iterations of a banana inside a globe by the AI

Ethical Considerations: Navigating the Future

While understanding how AI works is crucial, we must also recognize its ethical implications.

Firstly, humans are biased, and those biases often seep into the data used to train AI, leading to biased outcomes that may be unfair or unrepresentative.

Secondly, is it fair for AI to be trained on human-created content without the creatorā€™s consent or compensation, only to generate derivative works? These questions remain central to ongoing debates as we strive to balance innovation with fairness and accountability

But like I always say, ā€œthe future is bright if we design it rightā€ :D

Glossary

Deep Learning: Just as humans learn by recognizing patterns and making connections, AI learns in a similar manner through a process called deep learning.

Diffusion: Much like humans learn and improve through trial and error, AI can learn and refine its understanding through an iterative process called Diffusion where it incrementally adds noise to data like an image and tries to retrieve it back.

Reverse Diffusion: This is the reverse of Diffusion where the AI starts with a noisy canvas and iteratively refines it to generate the ideal output based on the input prompt.

Image embedding: The process of converting image data into numbers that captures the imageā€™s essential features in a way that AI models can process and understand.

Text embedding: The process of converting text data into numbers that captures the essential details from the text in a way that AI models can process and understand.

Latent Space: Latent space can be compared to a humanā€™s intuition or thought process, where we synthesize information based on multiple criteria to make sense of the world. For instance, when we think of something yellow, curvy, soft, and peelable, we instinctively identify it as a banana. Similarly, an AIā€™s thought can be referred to as the latent space where it stores a lot of variables like color, shape, peelability etc. This space allows AI to identify patterns, make connections, and draw conclusions, enabling to identify a banana based on what it looks like.

To learn more, check out
šŸ”— What is AI Art? How Art Generators Work (2024)
šŸŽ„ How AI Image Generators Work (Stable Diffusion / Dall-E)
šŸŽ„ Explained simply: How does AI create art?

šŸŽØ Huge thanks to Maitri Chauhan for the adorable illustrations ā€” they add so much charm!

šŸ™‹šŸ½ā€ā™‚ļø Letā€™s be friends! Connect with me on X and LinkedIn.

šŸ¤– Also check out my new FREE icon pack on Figma. Brownie points for finding the AI in there

--

--

Bootcamp
Bootcamp

Published in Bootcamp

From idea to product, one lesson at a time. Bootcamp is a collection of resources and opinion pieces about UX, UI, and Product. To submit your story: https://tinyurl.com/bootspub1

Jishnu Hari
Jishnu Hari

Written by Jishnu Hari

Product designer based in Seattle. Uncontrollably curious about the humankind and mind.

No responses yet