Generating Images: An Introduction to GANs by Building One

Abhinav Menon
Visionary Hub
Published in
31 min readJan 9, 2022

They say that imitation is the finest form flattery…

… and the same applies for Computers

What we’ll cover today

Introduction to this Article

  • Rundown of AI
  • Rundown of GANs

Overview of GANs

  • History
  • Practical metaphor

GAN Use Cases

  • Style GAN
  • Cycle GAN
  • Other GANs
  • Ethics of GANs

Introduction to our GAN Model

  • Architecture
  • Datasets
  • Libraries

Building a GAN

  • Discriminator
  • Generator
  • Combining models
  • Getting Our CIFAR Dataset Ready
  • Getting fake samples
  • Training and summarization
  • Control panel
  • Once you’re done: Running your model

Final Thoughts

  • 10 best practices to consider
  • Issues in GANs
  • Takeaways
  • Conclusion
  • References

Introduction: Creation is Hard, so is Imitation

It’s no secret that computers at its simplest are ignorant. They have no sense of context, no intent and no intelligence. But that doesn’t really matter because they are surprisingly smart. This meaning that their sheer resources gives them the ability to increase their skills at an aggressive rate given the proper parameters.

A rundown of AI

Its just that till now, the ability to quantify abstract data and explain the relationships between different factors is difficult. Since this is the crux, of many of the “intelligent” tasks we want computers want to do, this is an area that sees innovation at a rapid rate. But then using those factors to create something brand new is another challenge that must be faced. Because ultimately, imitation implies that you aren’t making a carbon copy but instead taking inspirations and key relationships (like ideas, colours or etc.) and then using those to create a like but distinctly unique piece of work.

Pretty close, but not the same! | Image source

Outlining these relationships is where the power of AI comes in. Artificial intelligence is not an all encompassing, data-hungry, computer-brain mixture but a framework to quantify abstract concepts and guide decision making based off the information its given.

However that also brings up a question about what exactly intelligence is. Is a human smart because we are able to create brand new stuff or because we have the tools that allow us to create new things. Because ultimately some might say, most of the “novel” or “unique” things we create are based off our relationships with nature (like math or science) or based on other people works and personal experiences. This is specifically important because this is the same principle that AI operates under, meaning that there is a way to claim that artificial intelligence is a manifestation of what we do just using different standards.

Rundown of GANs: A really ‘smart’ computer

So we briefly touched on what artificial intelligence is and how computer models aggressively learn but what does that actually mean in terms of creating a brand new image (which is ultimately what this article is about)? Well to illustrate this we can compare our Model to a Kid and Teacher in school.

The way we are going to achieve image generation is through the power of feedback. The simplest rundown is that the Student will create something which will then be evaluated by the Teacher who gives feedback with Yes or No. This might sound like bad feedback but following this process many times yields really impressive results.

What I just outlined is the Basis of Generative Adversarial Network; a zero sum game that is a big win in the artificial intelligence field.

GANs: Generative Adversarial Networks

History

GANs are a pretty new concept. Originally proposed by Ian Goodfellow (in 2014 with the paper Generative Adversarial Networks, this concept that had never even been considered in the past took off like a rocket.

Year by year the technology increased considerably to the point where technology became so sophisticated that the images no longer had any indication of being fabricated.

Evolution of GANs, both the clarity and sizes of images have changed significantly | Image source
Which ones are fake? Answers below… | Image source

A practical metaphor

If we go back to the example of a school child, at the beginning, the child will as expected have no context of what it is meant to do so the child will probably make up some random scribbles and submit this to the teacher. The teacher who would have previously reviewed real works can compare their knowledge to the student’s work to these real images and come to the likely conclusion that the student’s work is not the real thing. It’s important to note that the teacher and student learns at the same time, meaning that although the teacher benefits from seeing real images once every round, the tools required to distinguish between that and the students submission are pretty rudimentary to begin with. Also keep in mind that the real works and students’ works are labelled in such a way that the teacher doesn’t know if a submission is from the data set of real images or a submission from a student.

P.s: If you picked cats 1, 3, 7, 8 as fakes you’re right

P.s.s: If you picked 2, 4, 5, 6, and 9 you’re also right; they’re all fake 😉

Overtime both the student and the teacher get better at their jobs. This happens through lots and lots of repetition. Repetition means that at some point during the process that we talked about above, something of much importance will occur. The teacher will validate one of the students submissions. Obviously to us whatever this will still look like mindless gibberish but the teacher validating the students submission (meaning that the teacher thinks its part of the data set of real items) indicates to the student that they have done something right that they should repeat in the next round. However, after evaluating all the submissions, when the teacher finds out that they incorrectly assessed the students work, the teacher will also understand that next time they should look out for that kind of thing.

Slowly overtime, this results in the student understanding the characteristics that please the teacher. Maybe if this whole process was about creating images of cats, the student would understand that the every submission must have two eyes, two ears and the distinct facial features that every cat possess.

What I described above is the process that training a GAN requires. The student is the generator while the teacher is the discriminator. When they work together they create the model that we consider to be a GAN. And this isn’t far off from what humans do too. When asked to draw a cat there are specific features that we consider, specific relationships between elements (like the size of a cat’s eye to its whole face) and there are specific colour we attribute. These don’t just come from our imagination but are instead a product of the many experiences where we have seen an image of a cat or interacted with it in real life. The only differentiating factor is that our ability to pick up on these relationships is significantly ahead, streamlined and sophisticated.

The GAN Use Case

GANs are now doubt difficult to compliment. But that also raises the question: what purpose do they serve, because companies can’t just be pouring huge quantities of resources for nothing…

To answer this question, we are going to look at two of the most sophisticated areas for implementation.

p.s. if you want to see GANs in action got to thispersondoesnotexist.com

Image generation with StyleGAN

Lets startup with he most straightforward and probably most well known type of GAN use case: Image generation. Anyone going into GANs usually starts off with the image generation. The beuty of this area is how versatile this subset is. It can be anything from small image generation (something we’ll cover in quite a lot of depth in the following parts of this article) to ultra-high super realistic images. For this specific instance we will look at StyleGAN.

StyleGAN | Image source

StyleGAN which is highly sophisticated GAN from NVDIA designed for human face synthesis. The first iteration of the technology was so impressive that this technology was followed up with StyleGAN2 which is what the website is based on. The code is open to be run and became so successful that in 2021 there was even a StyleGAN3.

StyleGANs success can be summarized into five unconventional approaches and sophistication in the Generator’s architecture.

5 steps for Style [GAN]

  1. Progressive Image Synthesis: To get the quality of the image so high while also retaining a considerable size 1024 x 1024 pixels, the GAN was designed to slowly increase it load. StyleGAN started its journey by generating 4x4 images (really small) but by starting so simple, it was able to gain confidence and master this size very thoroughly. For every size once StyleGAN mastered it, it would double the length and width and restart the process. This meant it went from generating 4x4 images to generating 8x8 then 16x16 and so on until 1024x1024. More on that: https://arxiv.org/abs/1710.10196
  2. Different Sampling Procedure: Instead of upsampling by transpose convolutional layers (something we’ll talk about later on) it relied on neighbouring layers. This is followed by the major difference that the layers used for upsampling are bilinear layers not just regular layers. See more: https://arxiv.org/abs/1812.04948
  3. Generating a Style Vector: StyleGAN generates a style vector that uses samples from latent space to create an input that rsults in a 8 layer neural network. This is incorporated into the sampling procedure as a bias that affects the outputted image.
  4. Avoids Random Inputs: Traditional GANs use samples in latent space (noise) to create the basis for the generated image. StyleGAN uses a standard 4x4x512 input that stays constant and is manipulated later on to compensate for randomness and provide image diversity
  5. Extra Noise: Guassian Noise is an addition that comes to all parts of the image. The noise is random but results in an added level of fine detail by the end

Results

This technology has resulted in ThisPersonDoesNotExist.com which is an example made by an Uber engineer Philip Wang in 2019 to illustrate its capabilities. This has also resulted in stuff like NVDIA’s text to image synthesis site. Although built on a different GAN type (GuaGAN), it is a combination of natural language processing and GANs where a user can input text and get an output of what the model thinks this looks like.

NVDIA Text-To-Image | Image source

Image substitution with GANs

Image substitution (aka image translation) is a unique process where the GAN is required to not only generate an image but also determine what goes in a specific area based of the surrounding context. Basically, instead of creating a brand new piece of work, it creates a modified version of an already existing image that meet certain parameters. This has a lot of cool applications regarding the manipulation of images which we will talk about. One popularized example of an image translation GAN is CycleGAN

A couple examples of CycleGAN | Image source

CycleGAN is an image translation model where the objective is to create a modified image. With this being said, one of the limitations that needed to be addressed at the time of CycleGAN was the requirement of a paired dataset. Like every model GANs in general require a large dataset for training that needs to be tens-of-thousands to millions of instances large. Now creating a paired dataset in this example would mean having a before and after image for each of these instances. Since in simplicity the goal is to start with something (before) and change it into something (after), traditionally speaking the GAN would need a huge dataset full of instances where this process has already occurred so it can learn from it.

CycleGAN’s unpaired image translation approach was the solution to this. Instead of trying to modify the architecture of a traditional GAN (one generator and one discriminator), the approach was to add a another generator and discriminator as an addition. Architecture would allow for the simultaneous training of two pairs of generators and discriminators.

The architecture

  1. In CycleGAN the first generator takes from a pool of images as its input and then outputs an image that goes into the second pool. The second generator takes images from the first generator as input and generates a new set of images which then go back into the first pool, repeating the cycle. In the midst of this the discriminators work to evaluate the plausibility of the generated images. All of this is a pretty creative work around but still doesn't suffice in terms of translating two different images.
  2. To take this to the next level, the whole GAN is equipped with a concept of cycle consistency. Cycle consistency in this context suggests that if A → B then B → A. For CycleGAN that meant that if generator one sample from pool one and creates an image as an output, then generator two must be able to sample that output as its input and output what generator one sampled to begin with. If you want to think of it in another way, this is the same logic that translators aim to achieve where if you have a statement translated from English to Mandarin then if the Mandarin translation is then reverted back to English the result will stay the same.
  3. This concept is achieved using the a special Cycle consistency loss function that take finds the difference between the inputted image of the first generator and the outputted image of the second generator often by using L1 Norm or pixel values and adjusts the two generators slightly to become more and more accurate

Results

The results of this process is nothing short of amazing. Some results include style transfer: which lets you take an image and turn it into the art style of famous artistic styles, object transfer: that changes the object into another closely related object (apples to oranges, zebras to horses), colour transfer: turning black and white images into colour images, photo enhancements and much more

Another major breakthrough for CycleGAN-type architecture was with the implementation of the “magic eraser” (spoiler it wasn’t magic) feature on Google Pixel devices. This allowed Pixel 6 users to remove any object the wanted from a photo they captured which would then filled up with what a computer thought should be there.

Google’s Magic Eraser likely uses GANs | Image source

Other GANs to look into

  • MuseGAN: Developing short music samples. Site
  • BigGAN: High-tech image generation GAN similar to StyleGAN but for other objects (like, flowers, burgers etc). Site
  • Pix2Pix: Sketch to image GAN that turns your drawings into realistic images. Site
  • This ____ does not exist websites to go to: Art, Cat, Horse, Chemical
  • Aforementioned NVDIA Text-to-Image GAN. Site
BigGAN is super impressive! | Image source

Ethics of GANs

Always when faced with technology of this nature, that can significantly change the way that us as humans interact with the real world, we must take great caution. Here are two dilemmas we do not have solutions for yet:

Misinformation and disinformation are already huge problems in current society and the easier and more indistinguishable doctored content is from real content, the bigger the problem. There are already phenomena like deepfakes that can use someone's faces and superimpose that onto another person, meaning that whatever this person does will be misattributed unjustly. Stuff like this is inevitable with the sophistication that we have reached even today (any photo from thispersondoesnotexist can be a fake profile picture for a malicious account, without any traceability to the internet since the generated images are promptly discarded ) and there are no clear cut solutions.

There’s also the bug dispute over how this will impact real people. For example a (plausible in the current situation) breakthrough in music innovation could be detrimental to instrumental artists and so on. Furthermore there is the question of who owns what the GAN creates. Is it the work of the programmer of the GAN or the work of the people who created the dataset that the GAN was trained on.

Without either of these parties the final product would not be possible so in turn is it ownership by both of them? But would that be as unreasonable as saying that paint manufacturers should have ownership over paintings using their paint? Or none of them maybe, because its a generation of a machine…

These questions don’t have answers but with changing technology will come changing standards. For now its onto the next part where you will get to apply what you know with a guided GAN creation.

Introduction to Small Images with GANs

Congratulations on getting here. Now that we have a brief introduction to GANs, who made them and who’s working on them, what they can do and why this is impactful on society, its time to look at the HOW of GANs by building one ourselves.

Like we discussed above, the most straightforward approach to learning about GANs comes with image generation. Every GAN (including this one) requires a GAN architecture, a data set and libraries to execute the code.

GAN architecture

A GAN architecture is the process our model must follow during training to reach an achieved goal. This is the approach for training that we often look at for reference in sophisticated GANs like StyleGAN and CycleGAN.

The architecture is also the point of reference we most often scrutinize and develop on to increase the performance and abilities of the GAN.

We will be creating a small image generator which will need a couple of different parts, mainly:

  • CNN generator
  • CNN discriminator
  • A function that combines the generator and discriminator
  • A dataset that we can load and select images from
  • A function to sample points in latent space
  • A way to evaluate what our GAN is doing that is meaningful to us
  • And a set of instructions to successfully train our GAN

Dataset

In this specific instance we are going to work with the CIFAR10 Dataset a collection of 60,000 32x32 colour images. It is based off the Tiny-Images database which got sorted and labelled by Alex Kirzhevsky. This is an extremely valuable type of data set because of its easy manipulability, simplicity and consistency which all help simple models. This helps us grasp the valuable parts of image classification and generation while not having to deal with preprocessing and data sorting / cleaning which is not really relevant for us.

The CIFAR-10 Dataset is specifically split into 10 classes of images those being: Airplane, Automobile, Bird, Cat, Deer, Dog, Frog, Horse, Ship, Truck

Examples from the CIFAR-10 Dataset | Image source

The class that an instance of this dataset is from is not operative for the GAN to know but it will help us identify what our generated images are after completion.

An operative thought to consider is that this is a colour image meaning that there will be three channels to consider (Red, Grean and Blue) each with their own 32x32 array which when put together creates one colour image.

Really tiny, only 4 pixels in a 2x2 orientation

To illustrate this, imagine a really tiny Microsoft logo made of only four pixels. Quantifying this would look like the following.

We can see that the 2x2 image is represented by their RGB values by creating three separate arrays where pixel values (from 0 to 255) are stored in their respective places.

Now imagine this but for a a 32x32 image where 3 separate arrays of 1024 pixels has to be stored and you get one instance in the CIFAR-10 database.

Libraries

Since this is a simple GAN we will only be employing a few well known libraries for some operative tasks. In this project we will be using Keras (+Tensorflow), Numpy and MatPlotLib

Keras: Keras is an API of Tensorflow that is specifcally focused towards Machine Learning (and Deep Learning) that really helps with productivity and simplicity. Their official resources state “Being able to go from idea to result as fast as possible is key to doing good research.”

TensorFlow: TensorFlow is an open-source machine learning library designed by Google which exceeds at building models and provides flexibility for machine learning applications

NumPy: NumPy (Numerical Python) is an important library related to arrays (specifically multidimensional / ndimensional). This will be especially useful when manipulating images.

Matplotlib: Matplotlib is a library dedicated to visualizations of items. They can make images [something very useful for image generation ;) ] and graphs. This will mostly be used at the last stage to turn our model’s efforts into a tangible product

Building a CIFAR-10 GAN

We will be building a GAN using Python, ideally you would do this in a Google Colaboratory (and in an .ipynb file) because the GPUs are much faster than doing it locally.

Libraries

Here we are just importing all the necessary modules from the libraries we discussed earlier. Although we will go into more depth about the purpose of each function later.

Discriminator

If we print the process our CNN is following by executing the following:

then we get the following output:

Downsampling:

In our discriminator, our goal is to turn a 32x32x3 input into a binary output (true or false). To do this we are going to downsample our image using the keras.layers.Conv2D() function and create a feature map as the output. This will be in the order they are written above due to the Sequential() function.

The four arguments that we use throughout all our Conv2d() function is: filter, kernel size, strides and padding

The purpose of the filter is to identify features within the image that’s coming in. In our function, the integer that is the first argument indicates the number of convolutional filters there are in our layer. So if the argument is an integer 128, we know there are 128 convolutional filters who’s goal is to identify features in an image. Each of these filters identify features in the image by starting of with random values (this can be changed by and additional kernel_initializer argument although not necessary here) which slowly get changed through the training process to minimize the loss the model faces. This is an automatic process that comes with time and is not something that we can specify or force onto the filters.

Here is an example of 49 convolutional filters that look for specific relationships | Image source

Next we have kernel size. Kernel size indicates the dimensions and in turn the area that should be covered and downsampled (this is known as the convolutional window). If the numbers in the brackets look like (3,3), then we know that an area of nine pixels in a 3x3 pixel dimension will get their values averaged into just one value that gets added as a pixel at a relative position to the input into the feature map.

Two types of downsampling, the left takes the maximum value of the convolutional window while the right takes the average. We are doing the latter | Image source

The next argument strides, indicates how many pixels the filter should move before doing the downsampling again. In our case since the argument is an integer value of 2, the downsampling will occur after the filter moves two pixels over.

aThe final argument is padding. This relates to the shape of the whole downsampling process. Since in all the functions the type of padding we use is called ‘same’ we will look at this type. The purpose of ‘same’ padding is to keep the feature map the same shape as the input. In order to achieve this, downsampled feature maps have zeros surrounding them which is represented by the white space in the figure below. Although not used in this GAN, ‘same’ padding is similar to the ‘constant’ padding option although that allows specific values in the padding instead of just zeros.

Visual representation of padding where the blank areas are filled with zeros | Image source

LeakyReLU

A LeakyReLU short for Leaky Rectified Linear Unit is a type of activation function that is important after every manipulation in the layers. Its purpose is to solve the vanishing gradient problem that causes challenges during training. Every part of our convolutional neural network requires its weights to get updated for it to continue training. It does this through stochastic gradient descent. However in some circumstances the gradient will become so insignificant that the value does not get changed and the model does not get trained. The solution to this problem was a Rectified Linear Unit (ReLU).

ReLU’s are a special kind of function that dictates that if x < 0 then output = 0 but if x > 0 then output = x. The issue with this however is that the gradient decent that we use to update our model says that the gradient of 0 is 0 meaning that any input that has negative values (especially big negative values) cannot become better and train themselves causing the neuron to “die” and become useless and detrimental. Meaning using a ReLU only changes the type of problem causing an unstable model. The LeakyReLU solves the vanishing gradient problem and the ReLU’s limitation by dictating that if x > 0 the output = x but if x < 0 then output = zx. The float within the LeakyReLU function defines the coefficient z, which means in our case x < 0 then output = 0.02x

ReLU v LeakyReLU Comparison. Visually it’ll make sense where the “leak” occurs | Image source

Flatten + Dropout

After this process is repeated a couple of times we have the Flatten() function. Since our model is sequential, our 4x4x256 feature map will be the input into the Flatten. Flatten changes the shape of our array so that everything is in one dimension. This in turn means that our 4x4x256 ndimensional array turns into a 4096 on dimensional object. We can also validate this by referencing the above model.summary().

Visual Representation of keras.layers.Flatten() | Image source

This object now goes into the Dropout() function. Dropout is a type of regularization that is aimed to tackle the problem of overfitting. Overfitting in a quick description is the model learning irrelevant relationships in the training data (i.e. noise etc.) in such great depth that new data that isn’t from the training set cannot be evaluated properly. Dropout’s technique is by training the model to recognize that neurons within the network are unreliable. It does this by briefly “turning off” neurons and its relevant connections and in most cases this helps combat overfitting.

The float within the function (in our case 0.4) indicates the probability of a neuron getting the Dropout function applied to it. This approach was originally suggested by a research paper who also found that as a rule of thumb, 0.5 is good for hidden layers and 0.1 is good for input layers although this was based off results from the MNIST dataset.

Brilliantly simple diagram to convey Dropout() | Image source

Dense + Compiling

Okay so we now have a 4096, one dimensional array… How are we going to turn this into a binary value of true or false? This is where Dense() function comes in. Dense is most commonly used to change the dimensions and shape of an input object when it comes as an output. In our Dense function, the first argument is the integer 1 which specifies the size of our output. Since our output is simply a True/False, this fits the stipulations perfectly.

This is followed by a activation which in this case is ‘sigmoid.’ The ‘sigmoid’ function has a ability of changing the Dense’s output into something useful, specifically a number between 0 and 1

Sigmoid function visually | Image source

Finally we compile all of this in our model (ending the Sequential). We start off by evaluating our model with the loss function which maps decisions to its related costs. This will help our discriminator get much better over time. Our loss function is ‘binary_crossentropy’ (BCE)which means that the loss function is evaluating how good our model is at predicting the True-ness or False-ness of the image. This is achieved by using a sigmoid and a greater loss indicates a worse model.

Our model is trying to gauge the probability that an image is real/fake. Similarly, this BCE is trying to find the likelihood of a dot being green based off its location | Source

This is followed with a Adam optimizer which helps update the weights as an extension to the stochastic gradient descent and a metrics value that will be useful for evaluating our discriminator’s predictions later on.

Generator

The generator of this model has to take a random points from latent space (random Gaussian-distributed values), and turn it into a 32x32x3 image. In order to do this, we are going to start off with a smaller rudimentary image (in our case 4x4x3).

Foundations

A recommended practice when starting off with such a small set of dimensions is providing the model with multiple versions of the smaller image meaning there are simultaneous feature maps. This allows for more diversity and interpretations which helps our model and in this case since the dimensions of our target is 32x32 (4096 pixels) we can fit 256 4x4 feature maps in the same area.

This gives us our input for our Sequential model, which will go into our Dense function as the size of the output(which we already talked about earlier) along with our latent space (we will come back to this later) which will assign random values to each of the 4096 pixels.

This goes through a LeakyReLU (which we talked about earlier) and then gets reshaped into our targeted 4x4x256 image that we wanted.

Upsampling

Just like we outlined our next step will be to use our CNN to upsample the image to eventually reach the 32x32x3 target. To upsample we will be using the Conv2DTranspose() function. This is just like its downsampling counterpart except each convolution takes the 4x4 pixels as specified by the kernal_size and upsamples it. These are then stored in the feature map output as a relative position to the input.

A transposed convolution where there is a 2x2 kernel and a stride of 1. When the values need to be detrmined that do not exist, multiplication is done between nearest existing neighbours or in the case of a corner, the nearest corner is squared | Image source

This is followed by LeakyReLU and then a final downsampling which turns our 32x32x128 feature map into a 32x32x3 RGB image which we can then return.

Combining the Two Parts

Here we are basically creating one model that has both the generator and discriminator and then compiling them. The reason we make the discriminator non-trainable is in this setup, the the trainability impacts our discriminator negatively. This however does not stop the layers within the discriminator from getting trained meaning we are going to utilize this feature when updating our discriminator.

Getting Our CIFAR Dataset Ready

Here we are loading real samples. When we call the CIFAR-10's load_data() function we are returned with a training set of (50,000 images) with their respective labels and a test set of (10,000 images) with their respective images. In this case we only need the images and not the labels and the test set is not relevant.

After this, since we need to turn the values from 0 to 255 (the value representing the intensity of the R, G, or B) to values between -1 to 1, we will first turn the integers into floats and then scale them down.

We can follow this up by defining a function to generate the images. When given the dataset and the number of necessary samples, this has to provide the a number of (the integer determined by n_samples) random instances from the CIFAR-10 Dataset and give them back with the label 1 to indicate they are real and not generated by the generator. The randint() function gives us random integers from the range 0 to the dataset’s shape (in this case 50,000).

Generating Fake Samples

To begin generation, our generator needs a foundation to build on. Since we want our generator to be making the image, a random set values (noise) from our latent space (latent_dim) will be needed. The number of random values generated by our randn() function will be determined by the number of samples we need (n_samples) which can then get reshaped for our neural network and returned.

Here we see the fake samples actually getting generated. First we call on the generate_latent_points() which we just covered and this goes into our generator who generates the images. This will get sent off with a label of zero meaning that they are fake and by extension not from the dataset.

This function will be used during training to plot generated training images into a visual format. Since all our values are between -1 and 1, we need to turn them into values between 0 and 1 for pyplot. After plotting all the images we will give it a file name with the appropriate epoch number (plus one because the epochs start at zero) and then this will be saved. The current configuration is for Google Colaboratory but if you are doing this locally or elsewhere change the directory accordingly.

Training and Summarization

Okay so here is the training function. This may look overwhelming but it is the fundamental basis for allowing the GAN to perform. You will see that the function takes in a lot of arguments, this is by design so that this model can be adapted, changed and reused really easily.

To figure out how many instances will be in batches we will use the number of specified batches and them by the number of instances available. For the CIFAR-10 dataset, the answer is 390 (rounding down because we can’t make more the 50,000 images) when dictating 128 instances as our n_batch variable. This means each epoch 390 batches will be used.

We will be using a GAN best practice where we will separate the real and fake images. This also means the discriminator gets an update two times per batch. We start off by calling our generate_dataset_instances() function which brings back the instances from the dataset and a label that they are real. This goes into our discriminator with the train_on_batch() function. This means that the batch will only be used once (not multiple times like fit()) and the discriminator will get updated with the results. The loss of this is then stored to get printed.

This same process occurs with the generated predictions which are then taken through trained_on_batch() and used to update the discriminator. With this information, we can take the whole GAN model and do another train_on_batch() where the input data is samples from latent space and the target is 1 (the label for a real image). This will update the whole GAN model and the loss will be stored for printing

We will print the loss statements once every certain number of times which is determined by the batchloss variable. After a whole epoch is complete, we will save the epoch results by calling save_plot() as dictated by our choosing. Furthermore, once every certain epochs (determined by the int represented by summarization) the summarize_performance() will be called on the details of which we will talk about next.

Here we follow similar steps to our train function except instead of train_on_batch() we are evaluating our model’s loss. This means that the models are not updated at this step and metrics (like the one present when we created the discriminator) will be used. These comes as decimals but we will just turn them into percents for ease.

After we have our performance summarized we will save the weights of our model. Make sure to specify your directory if you aren’t on Google Colaboratory. Saving weights is useful if your model stops training unexpectedly, the program crashes or the model fails. In all these cases the model can get reverted to an earlier instance and retrained from there. Finally we will save the generator itself which is the final product of the GAN. This will be what we use to generate more images after training is done

Control Panel

The control panel gives you some amount of flexibility with what you do without messing with all the previous code. We have the size of the latent space which is arbitrarily set to 100. This can be changed for different results although making it too small could result in mode collapse.

Summarization indicates how many once every how many times do you want to summarize your performance and save your model. In this case, since the int is 5 it will do so every 5 epochs

Batchloss dictates how many batches you want to go through before getting a printed batch loss. I set it to 10 and any number is acceptable although when you are diagnosing for problems you might want this to be lower. Set it to over 390 to never get a report and 0 to get one every batch.

N_epochs means how many epochs you want to run this for, in many cases more is better. N_batch dictates how many instances you want per batch. Changing this will also change the number of batches so be prepared. Save_freq refers to how often you want your images saved from training. In our case, since its 0 we will get one every epoch.

The load_weights variable is dependant on whether you want to resume training from a previous spot. When set to true this will also require the epochs variable to be consistent with the number on your “params” files. If you are starting from scratch just set load_weights to false and epochs to 0.

The final lines are just going through everything we already covered and bundling everything up into variables for the train() function.

Once You’re Done

Once you’re done training to generate new images start by patting yourself on the back! This isn’t easy even at such a small scale!

But if you want to see the fruits of your labour, follow the code below in a new cell (if you are in Google Colobaratory) or a different file.

Everything here is stuff that we have covered. All we are doing here is using our generator to create images. The number variable decides how many images you want in your plot (remember that this number gets squared so 10 is really 100 images). The only other thing you should consider is the file name. Make sure that matches up with the epoch that you want and everything is splendid!

See the full code in my GitHub

Want to see my results?

I’ll be posting a video soon where I share some of my stories and some further steps that I took with this project! Stay tuned (and follow me for updates)!

Huge thanks to the following articles for making this project possible!

This project is largely based off of: CIFAR-10 GAN from Machine Learning Mastery

Additional support from the following: CIFAR-10 ACGAN from King of Knights, MNIST CGAN from Machine Learning Mastery, MNIST ACGAN from Machine Learning Mastery

Final Thoughts…

This article can obviously not tell you about every single thing related to GANs. Its time to experiment and combine different tutorials and broaden your understanding. This is part of my learning journey and a way to account for my understanding. As you move forward in this field consider these best practices from this presentation at a GAN conference and this paper:

Best practices with GANs

  1. Upsample and downsample using strided convolutions. This is what we did in our models and helps the performance of the generator and discriminator.
  2. Use LeakyReLU. We talked about the benefits of this above when we used it in our models. Highly recommend especially over just a regular ReLU.
  3. Using the BatchNormalization() function helps our GAN by rescaling and reestablishing our neural network’s connection although it’s still not clear even between experts how it does this. We didn’t use this in our GAN but this is something to keep in mind.
  4. Scale images. Scaling our input images to something like -1 to 1 helps our model by keeping consistency with the the labels its outputting, the generated images, and real / fakes inputs. This is something we did do in our GAN.
  5. Train batches based off their labels. When training the discriminator one common practice was to input both the real and fake samples at the same time. Now it’s seen to perform better when the real and fake samples are trained separately in their own batches. We did this using our train() function.
  6. Reduce discriminator label reliance. By arbitrarily switching real and fake instance’s class labels (i.e. making a generator’s output a real by labelling 1 even when its not) we can improve performance of the discriminator
  7. Use Guassian distribution aided latent space. We did this with generate_latent_points() and it helps increase diversity in the images the generator is creating.
  8. Use the Adam Optimizer. We discussed previously how the Adam optimizer benefits the continued updating of our model’s weights.

The two issues to look out for in GANs

GANs are obviously not simple in anyway and that also means there are a whole host of problems to consider. Of which the two most important to keep a lookout for is mode collapse and non-convergance

Mode collapse

Mode collapse is a severe problem where you’re GAN stops playing by spirit of the GAN game and sets its objective to solely fooling the GAN. In this description we have personified our generator but what really is happening is the GAN is cycling between a select few “modes” to trick the discriminator into validating it. Usually this involves the generator outputting a certain image that the discriminator validates. After finding out it is validated incorrectly, the discriminator now knows that what the GAN sent is the wrong type of image and therefore the other modes are the correct version. This prompts the generator to choose one of the other modes as its next input and the discriminator restarts the cycle by now determining that mode as incorrect and the others ones as correct.

If we are to take the example of a a generator that is built to make cats and dogs. If our generator submits a cat, the discriminator will now dictate that all cats must be fake and therefore, all dogs must be real. Based off this logic, our generator will create an image of a dog which, reverses the fooled discriminators predictions which will now say all dogs are fake and all cats are real. In this case our two modes are the cat and dog and when printed out as an image we will see and oscillating pattern. The tell tale sign that your GAN has suffered mode collapse are poor diversity amongst images which will be very evident. This is very difficult to resolve and it’ll take a bit of searching to find an answer

Loss of a mode collapsed GAN and the discriminator's accuracy | Image Source

Non-Convergence

Non-convergence is the situation where your GAN doesn't seem to “converge” from its basic noise stage. Often converging in the GAN is a difficult process that comes with the iterations but seeing noise after several epochs especially without proper colours or shapes is a big tell-tale sign. Make sure this is not the same as your images not looking like your dataset. This is just a matter of training and your GAN working through the situation. If your images don’t look quite right, that is not a sign of non-convergence. However if there is a lot of noise, distorted images and so on, that is non-convergence and reviewing your architecture is quintessential.

This is a instance from my GAN experiments where after 180 epochs my model is suffering from both mode collapse (look for similar patterns) and non-convergence. This is supposed to be a CIFAR-10 ACGAN.
Here is an example of a healthy GAN while training based off the tutorial we did today at 180 epochs. This is the CIFAR-10 GAN we did today.

10 Takeaways to Consider

  1. GANs in a nutshell is a feedback loop where two components compete to outsmart the other one.
  2. Repetition is key, with slow progress and random chance, great things can happen (I’ve found this true in the real world too)
  3. GANs is a young area that is ripe for innovation, yet its pure science is slowly having more and more uses in the real world
  4. GAN architecture is what’s really to consider, architecture = accomplishments
  5. Be meaningful with what you do, just with any other technology GANs ethics have a huge impact on society
  6. The discriminator must turn an image into a true and false output and it can do so by using CNNs
  7. The generator must turn random noise and multiple smaller feature maps into one bigger output. This can be done by upsampling
  8. Tweaking weights is key for training and using optimizers, and training procedures all aim to do this one thing
  9. GANs are like building blocks. You can combine them and add new ones to make a completely new architecture. Some of those building blocks are best practices to keep in mind
  10. Always be aware of GAN problems. The major ones are mode collapse and non-convergence and they don’t have checkbox answer to resolve the problem

Conclusion

Overall it’s clear the depth and complexity GANs have. For a simple 32x32 GAN there are so many factors to compare. The breakthroughs done by StyleGAN and so on are beautiful and something to marvel. The basic concept is smart, taking right from our evolutionary and societal need to compete and play games. This was introduced not so long ago. In 2014 this concept was nonexistent and now, just 8 years later, we are standing on the shoulders of giants.

Every function, every optimizer, every method has been carefully reviewed by papers (much longer than this article) proposed, scrutinized. And yet we can argue we still don’t know why GANs work why they do. “GANs are a black box, full of mystery” such a simple quote to such a complex topic. And in a way, every GAN is a unique piece mimicking nature and its randomness. I’ve heard that GANs are kind of like an alien, getting relationships and ideas pretty accurate but just not quite there. Yet the pursuit to make GANs as a whole just a little but more better than yesterday is admirable and a journey that I hope, you will join me on.

-Abhinav Menon

Hey! I am Abhinav a 14-year-old who is passionate about the environment, people and technology. If you enjoyed this article, give it a clap! You can find me on LinkedIn, I would love to connect! Stay creative!🦄

--

--

Visionary Hub
Visionary Hub

Published in Visionary Hub

A platform built for young writers to express their passion and knowledge of technology, philosophy, and other relevant topics in today’s world.

Abhinav Menon
Abhinav Menon

Written by Abhinav Menon

Doing something awesome or reading someone doing it :)

No responses yet