Create Any Image With C# And A Generative Adversarial Network
In this article I’m going to build a specialized neural network architecture called a Generative Adversarial Network (GAN).
GANs are pretty weird. Here’s what they look like:
The Generator is a convolutional neural network (CNN) laid out in reverse.
A normal CNN reads in an image and outputs a list of class probabilities which usually indicate if the image contains a particular person, animal, or object.
But a reverse CNN does the opposite: we create a 1-dimensional class vector (basically just a list of numbers) and the network will convert this information to a fully realized machine-generated color image. And by tweaking the class vector, we can make the network generate any kind of image.
Of course these images might not be very good. The generator might try to generate a picture of a horse, but how would it know if the output image looks anything like a real horse?
This is where the Discriminator comes in.
The discriminator is a regular CNN that is trained to identify images of people, animals, objects, or landscapes. Its job is to look at the images created by the generator, compare them to a dataset of real images, and correctly identify every real and generated image.
We can now run the generator and the discriminator against each other. We have the generator create a stream of fake images and then feed these images into the discriminator. Then we ask the discriminator to classify all the fakes produced by the generator:
At first, this will be easy because the generator will not be very good at creating fakes. But after each training epoch, the generator gets a little better at producing fakes, and the discriminator becomes a little better at spotting the fakes.
We continue to train, until the fakes produced by the generator have become so good that we humans can no longer tell the difference between the fakes and the real images.
This specific architecture is called a Generative Adversarial Network (GAN), and it’s a very cool and active area of research right now.
We can use GANs to create a wide range of computer-generated images:
- human faces
- computer game characters
- …. and much more!
Four years ago, a group of machine learning researchers used a GAN to create pictures of human faces. Here’s what the start of the art looked like back then:
Not bad, right?
But that was four years ago. Check out what’s possible today:
Keep in mind that these people don’t exist anywhere on the planet. Their faces have been randomly constructed, one pixel at a time, by a GAN trained on a dataset of human faces.
How cool is that?
I’m going to build an app that sets up a generative adversarial network, trains it on the well-known image dataset, and use it to generate images of frogs.
I’m going to use the famous CIFAR-10 dataset which contains 60,000 color images of 32x32 pixels. Here’s a sampling of the dataset for 10 random classes:
I’ll use a subset of the dataset that contains 5,000 images of frog species. You can download the image subset here. Save the file as frog_pictures.zip and place it in the project folder I’m about to create below.
My challenge is to build an app that uses a GAN to generate better and better images of hypothetical frogs, until the images are so good that they look exactly like real frogs.
Let’s get started. I will create a new NET Core application from scratch:
$ dotnet new console -o GanDemo
$ cd GanDemo
I will place the zip file with the frog images in this folder because the code I’m going to type next will expect it here.
Now I’ll install the following package:
$ dotnet add package CNTK.Gpu
The CNTK.GPU library is Microsoft’s Cognitive Toolkit that can train and run deep neural networks. It will train and run deep neural networks using your GPU. You’ll need an NVidia GPU and Cuda graphics drivers for this to work.
If you don’t have an NVidia GPU or suitable drivers, the library will fall back and use the CPU instead. This will work but training neural networks will take significantly longer.
CNTK is a low-level tensor library for building, training, and running deep neural networks. The code to build deep neural network can get a bit verbose, so I’ve developed a little wrapper called CNTKUtil that will help me write code faster.
I’ll download the CNTKUtil files and place them in a new CNTKUtil folder at the same level as the GanDemo project folder.
Now I can create a project reference like this:
$ dotnet add reference ../CNTKUtil/CNTKUtil.csproj
Now I am ready to start writing code. I’ll edit the Program.cs file with Visual Studio Code and add the following code:
This code calls ZipFile.ExtractToDirectory to extract the dataset from the archive and store it in the project folder.
Then I use the DataUtil.LoadBinary method to load the frog images in memory. You can see from the arguments that I’m loading 5000 images of 32x32 pixels with 3 color channels.
Now I need to tell CNTK what shape the input data of the generator will have:
The input to the generator is a 1-dimensional tensor with a preset number of latent dimensions (32 in this case). By inputting numbers on these 32 input nodes, I trigger the generator to create a new image that will hopefully resemble a frog.
My next step is to design the generator. I’m going to use a reverse CNN with 7 layers that looks like this:
Here are some interesting facts about this network:
- The generator looks like a CNN in reverse. We start with a 1-dimensional input vector, the layers get progressively larger, and we end with a 32 by 32 pixel image at the output.
- There’s a layer called Reshape. It converts the 1-dimensional output of the dense layer into a 2-dimensional tensor which more closely resembles an image.
- There’s a layer called Transposed Convolution. This is a reverse version of the convolution layer that makes the image larger, not smaller.
- All the convolutional layers use the LeakyReLU activation function, not the regular ReLU function.
- The final convolutional layer at the end uses Hyperbolic Tan activation, and not the Sigmoid function.
You might be wondering how this specific architecture was discovered, and the answer is: by plain old trial and error. Many researchers have tried to create stable generators and they have come up with these guidelines.
Nobody really knows why this particular network architecture works well and all the alternatives are unstable. The field of machine learning is simply too young, and for now all we have are these field-tested rules of thumb.
I’ll add the following code to build the generator:
This code calls Dense and Reshape to add a classifier and reshape the latent input to a tensor with a 16x16x128 shape. Then the code calls Convolution2D to add a convolution layer with a 5x5 kernel.
The magic happens in the ConvolutionTranspose call that sets up a reverse convolution layer. The 16x16x256 input tensor is blown up to 32x32x256 using a 4x4 kernel and a stride of 2.
The output flows through two more Convolution2D layers and then encounters the final Convolution2D layer that converts the tensor to 32x32x3, exactly the dimensions of a generated CIFAR-10 color image.
Note the use of the LeakyReLU activation function everywhere, except in the final convolution layer that uses Tanh.
Now I need to tell CNTK what shape the input data of the discriminator will have:
Remember that the generator and the discriminator are mounted end-to-end, with the generated image feeding directly into the discriminator. So the input of the discriminator is the generated image, a tensor with a 32x32x3 shape.
Now I’m ready to build the discriminator. This is a regular convolutional neural network with 4 convolutional layers:
However, note that:
- There are no pooling layers in the network. Instead, the convolution layers use a 4x4 stride to reduce the size of the feature maps.
- All the convolutional layers again use the LeakyReLU activation function for extra stability.
This was again discovered by trial and error.
Neural networks that use strides are a lot more stable and robust than networks that rely on pooling. And we need that extra stability to get the GAN working.
The following code will build the discriminator:
Note the four calls to Convolution2D to set up the convolution layers with LeakyReLU activation, and the Dropout and Dense calls to add a dropout layer and a classifier using Sigmoid activation.
Now I’m ready to assemble the GAN. This is very easy:
The Gan helper class has a nice CreateGan method that will assemble a GAN from a generator and a discriminator by joining them together.
Now I need to tell CNTK the shape of the output of the GAN:
Remember that a GAN is a generator and a discriminator laid end to end. The discriminator classifies all images into fakes and non-fakes and only has a single output node. So the GAN itself also has a single output node, and I can tell CNTK that the output is a 0-dimensional tensor (= a single node).
Now I need to set up the loss function to use to train the discriminator and the GAN. Since I’m basically classifying images into a single class (fake/non-fake), I can use binary cross entropy here:
Next I need to decide which algorithm to use to train the discriminator and the GAN. There are many possible algorithms derived from Gradient Descent that I can use here.
For this article I’m going to use the AdaDeltaLearner. This learning algorithm works well for training GANs:
I’m almost ready to train the GAN. My final step is to set up trainers for calculating the discriminator and gan loss during each training epoch:
The GetTrainer method sets up two trainers that will track the discriminator and gan loss during the training process.
Now I’m finally ready to start training!
Let’s set up an output folder to store the images:
And now I can start training:
I am training the GAN for 100,000 epochs using a batch size of 8 images.
Training a GAN is a 5-step process:
- Run the discriminator to get a list of fake images
- Combine real and fake images into a training batch
- Train the discriminator on this batch
- Combine real and fake images into a misleading batch where every image has been incorrectly labelled
- Train the GAN on the misleading batch to help the generator create better fakes
Let’s start with training the discriminator:
The Gan.GenerateImages method uses the generator to create a list of fake images. I then call Gan.GetTrainingBatch to get a new batch of 8 images to train on. This batch contains a mix of real and fake images with each image correctly labelled.
I then call TrainBatch on the discriminator trainer to train the discriminator on this training batch. This will help the discriminator get better and better at spotting fakes.
I am now halfway done. Now it’s time to create the misleading batch and train the GAN:
The Gan.GetMisleadingBatch method sets up a misleading batch. This is a training batch of real and fake images with every image labelled incorrectly.
I then call TrainBatch on the GAN trainer to train the entire GAN on the misleading batch. This will help the generator create better fakes to fool the discriminator.
And that’s the entire training process.
Now let’s log my progress every 100 epochs:
And I’m going to pull the generated image from the middle of the GAN every 1,000 epochs:
This code pulls the first generated image from the generatedImages variable and calls Gan.SaveImage to save it to disk.
Optionally you can uncomment the second block of code to also save one of the training images to disk. This will let you compare the generated images to the training images, to see how well the GAN is doing at faking frog pictures.
I’m now ready to build the app. I’ll start by building the CNTKUtil project:
$ $ dotnet build -o bin/Debug/netcoreapp3.0 -p:Platform=x64
This will build the CNKTUtil project. Note how I’m specifying the x64 platform because the CNTK library requires a 64-bit build.
Now I’ll do the same in the GanDemo folder:
$ dotnet build -o bin/Debug/netcoreapp3.0 -p:Platform=x64
This will build the app. Note how I’m again specifying the x64 platform.
Now I can run my app:
$ dotnet run
The app will create the GAN, load the CIFAR-10 images, train the GAN on the images, and extract a generated image every 1,000 epochs.
Here’s what that looks like:
The discriminator has 790,913 trainable parameters, and the generator has over 6.2 million! This truly is a monster of a neural network.
Training a single epoch is reasonably snappy on my Surface Book 2 with a GeForce GTX 1060 GPU, but it still takes over 90 minutes to reach epoch 50,000. Training this GAN really takes a very long time.
Here is my first run. I aborted the training at epoch 50,000:
You can see that I was off to a good start. The generator mixes green and purple pixels and creates vertical black stripes in an effort to create a frog pattern.
Unfortunately the GAN got stuck at epoch 21,000 and kept generating the same image over and over: a purple background with fuzzy vertical black stripes.
This is typical when training GANs. The loss surface is absolutely littered with false minima and the training loop can very easily get stuck in a solution that doesn’t look much like a frog at all.
The only solution is to restart training and hope that the next run will produce better results.
Here’s my second attempt:
I aborted this run at epoch 10,000. You can see that this attempt got stuck right at the start. Every image after epoch 1,000 is just a featureless pink square.
The loss function for this GAN really is a minefield with false minima lurking everywhere.
Here’s an older run I did on my MacBook Pro using only the CPU:
This one only got up to epoch 3,700 (CPU computation is much slower than the GPU) but the patterns are starting to look very nice.
Anastasios Stamoulis, the original author of the code that CNTKUtil is based on, ran this code on a much more powerful computer and this is what his frogs looked like once he got near epoch 66,000:
That’s actually starting to look a lot like real frogs!
Feel free to grab my code and run it on your own computer. How realistic can you make your machine-generated frog images?