GANs: How the magic happens

5 min readNov 13, 2023

In the realm of artificial intelligence, two groundbreaking technologies have been making waves: Generative Adversarial Networks (GANs) and Multimodal AI (see last week’s blog for Multimodal AI). While GANs are primarily known for their ability to generate realistic images, Multimodal AI excels at analyzing and interpreting them. As you would expect, these two technologies are somehow related, as they share underlying principles that make them complementary. In this blog, we’ll delve into the intricacies of GANs, exploring how they work; how the magic happens.

What are GANs?

Generative Adversarial Networks are a class of artificial neural networks that consist of two main components: a Generator and a Discriminator. The Generator creates synthetic data (like images), while the Discriminator evaluates the generated data, distinguishing between real and fake samples. The two networks are trained in tandem, with the Generator striving to produce data so convincing that the Discriminator can’t tell it apart from real data. This process is why they’re named “Adversarial”.

How GANs Work

Let me try to describe the process in a way that everyone understands because if you understand a ReLU function you surely don’t need to read this blog.

Initialisation

We start with the Initialization process. In order to get started we need Random Noise and a Generator Network.

Random Noise: This is essentially an array of random numbers that serves as the initial input to the Generator network. Think of it as a bag of marbles. Each marble has a random number inside it. These marbles are the “Random Noise” that kick-starts the whole process.

Generator Network: This is a neural network trained to transform this random noise into a synthetic image. Initially, the Generator is not very good at this task, but it improves over time and we’ll see how. This generator is like an Art Machine, it will attempt to create images using all the marbles. Do you know those images created from many smaller images? This is something similar to that.

Forward Pass

After Initialisation comes the Forward Pass. This is the process of making the art. How does the generator create the images?

Transformation: The noise vector (read marbles) is fed into the generator and undergoes a series of transformations. These transformations are mathematical operations defined by the neural network’s architecture and parameters. In other words, the generator starts doing some wizardry by changing the numbers around, adding them up, and doing all sorts of calculations. Magic, for most of us.

Activation Functions: After all the wizardry in the transformation phase, guess what? More wizardry. As the noise passes through each layer of the Generator, it undergoes various activation functions with funny names like ReLU (Rectified Linear Unit) or Tanh (Hyperbolic Tangent). All this wizardry adds non-linearity to the transformation process, which is to say that they make sure the numbers inside the marbles turn into something really cool and not just a boring old scribble.

Upsampling: The generator, or art machine for you and me, takes the tiny doodle it started with and makes it bigger and better, step by step, until it’s a full-sized picture. It uses techniques like transposed convolutions to upsample the noise vector, or to put it simply, it uses wizardry.

After all the spells are cast, we get, as you would expect, an Output.

Synthetic Image: At the end of the forward pass process, the Generator outputs a synthetic image. Initially, this image is likely to be of poor quality and not resemble any meaningful form. You will be tempted to look at the blob (read synthetic image) and think that the art machine is broken or the spells didn’t work, but that’s not the case, it’s just that we still need to go through the training and feedback phase.

Remember the “Adversarial” in GANs?

Training and Feedback

As you know, these networks have been trained on billions of different images so they do have a good grasp of how something looks. So now it’s time to evaluate how good of a job the art machine is doing and provide feedback so it can improve.

Discriminator Evaluation: The synthetic image is then passed to the Discriminator network (Judge for us normal people with normal lives) along with real images. The Discriminator’s (just read judge) job is to classify which images are real and which are generated.

Backpropagation: Based on the Judge’s 🙂 feedback, the Generator adjusts its parameters using a technique called backpropagation. This involves computing the gradient of the loss function with respect to each parameter and updating it in a direction that minimizes the loss. Just ignore that last sentence: The art machine listens to the Judge and learns from its mistakes. It tweaks some knobs and dials inside itself to get better at drawing.

Iteration

Just do it again, and again, and again, and again, for you and me to what it seems Ad Infinitum.

Loop: The steps above are repeated millions of times (or more) until the art machine (aka Generator) becomes proficient at generating realistic images from random noise.

The Masterpiece

The last phase of all this wizardry, which has no other suitable name, is the Convergence. This is when the final output is generated. After sufficient training and feedback, the Generator becomes capable of transforming random noise into realistic images that are almost indistinguishable from real ones, at least to the Discriminator’s eyes.

You’re the ultimate Judge, so only you can say if what the art machine is generating is so good and realistic that it can be called Magic or not. But remember, it’s not just the art machine that is learning, the judge is also learning.

I’m sure you knew the image on this blog was generated by a GAN. Is it Magic?

Challenges and Considerations

Ethical Concerns: The ability of GANs to generate realistic images raises ethical questions, especially when used to create deepfakes.
Computational Costs: Both GANs and Multimodal AI require significant computational resources, making them expensive to train and deploy.
Data Privacy: The use of real-world data for training poses privacy risks that must be carefully managed.

Conclusion

GANs and Multimodal AI are two powerful technologies that have revolutionized the field of image analysis. While they may seem unrelated, their underlying principles make them complementary, offering exciting possibilities for future research and applications. By understanding the capabilities and limitations of each, we can better harness their potential to create more intelligent and versatile AI systems.

So, the next time you marvel at a hyper-realistic image generated by a GAN or read an accurate description provided by a Multimodal AI system, remember that these two technologies are not just standalone marvels but part of a broader ecosystem that’s pushing the boundaries of what’s possible in AI.