Understanding Image Generator for a Newbie (Part 2)

Pranav Wankhedkar
4 min readJul 10, 2024

--

There are many ways to generate an image. As we can see in the model, the “Diffusion model” is mentioned, indicating that this particular algorithm uses diffusion as a method to generate images. What is diffusion, though? Let’s see the dictionary meaning for this.

According to the Cambridge dictionary, “diffusion” means “the action of spreading in many directions” and “the process of spreading through or into a surrounding substance by mixing with it”. The second part of the definition clarifies how diffusion can be used to generate images.

Let’s consider an example before diving into how the model works. Imagine you’re in college and uncertain about your future education or career; you enrolled because your friend did. We can describe this uncertainty as “noise” in your mind. It’s hazy; you can vaguely see something, like a road ahead, but not much else. As you progress, you start to gain clarity about who you are and what you truly want.

So, from noise, we progress to a clearer picture by connecting the dots. Similarly, the diffusion model converts noise into an image by connecting these dots. Let’s explore how it works.

First step — Text encoding (as we can see in the figure)

Process:

  • You start with a text prompt, like “an astronaut riding a horse.”
  • The text encoder takes this prompt and transforms it into a format that the computer can understand, like turning words into numbers that represent the meaning of the phrase.

(Think of it as a translator, it translates your language into the language which computer understands which is binary, zeros and ones)

Looks something like this:

These images are generated with this code.

Second step — Initial Noise Generation

Process:

  • A Random Noise Generator (RNG, which you can see in the figure after text-encoder) creates an initial 64x64 patch of random noise (reflect on the example we saw earlier). This is just a bunch of pixels with random colors.

Note — We won’t get into the architecture of RNG or how the noise is generated, as it is not the topic of discussion, but if you want to know further go here.

(Imagine starting with a blank canvas but instead of being white, it’s filled with random static, like what you see on a TV with no signal.)

Looks something like this:

These images are generated with this code.

Third Step — Diffusion model iterations

  • The diffusion model takes the encoded text and the noise patch.
  • It then goes through 50 iterations (because in the diagram we can see loop*50, which means this particular model does 50 iterations), gradually refining the noise into an image that represents the text prompt.
  • In each iteration, the image gets clearer and more detailed.

(Think of the diffusion model as an artist. The artist looks at the static-filled canvas and, using the text prompt as a guide, starts to paint over it. With each stroke (or iteration), the picture becomes clearer and more recognizable.)

  • Iteration 1: Starting from random noise
  • Iteration 10: Starting to see some shapes
  • Iteration 30: More details and shapes appear
  • Iteration 50: A clearer and more defined image

Fourth Step — Decoding the final image

Process:

  • The final refined patch is sent to the decoder.
  • The decoder translates the refined patch into a high-resolution, detailed image.

(The decoder is like the final touch-up artist who adds the finishing details to make the image look realistic and polished.)

Illustration:

These images are generated with this code.

Full Process Recap

  1. Text Encoding:
  • Translate “an astronaut riding a horse” into a format the computer understands.

2. Initial Noise Generation:

  • Create a random noise canvas.

3. Diffusion Model Iterations:

  • Gradually refine the noise into an image over 50 steps, guided by the text.

4. Decoding:

  • Final touch-ups to produce a high-resolution, detailed image.

By the end of this process, the computer has taken a simple text description and turned it into a detailed and contextually accurate image. This combination of text encoding, noise generation, iterative refinement, and decoding makes it possible to generate images from textual descriptions, much like how an artist creates a painting from an idea.

--

--