Generating (Mediocre) Pictures of Cars Using AI

Constantin Koch
The Startup
Published in
15 min readJul 1, 2020
The result after 750 epochs of training.

Today we’ll look at AI-generated pictures of cars. To do something like this, we need so called ‘generative adversarial networks’ or just GANs. GANs can create realistic pictures of things (a good example is https://thispersondoesnotexist.com), can be used to make chatbots more realistic and much more. For the project I used Python with PyTorch on a Jupyter Notebook, which you can find here.

Background

This AI-project is my course project for the course Deep Learning with PyTorch: Zero to GANs which is still available on YouTube. This six week course covered many machine learning topics from tensors and linear regression to convolutional neural networks and GANs.

On this note something about GANs. Different to many other ML-techniques GANs need two different neural networks: a generator and a discriminator. As you might have guessed the generator is for generating the images and the discriminator is for discriminating between real and generated images. The two neural networks are competing, while the discriminator always wants to label generated images as fake, the generator wants to fool the discriminator into thinking that the generated images are real. At the start of training the discriminator learns, what real images look like, while the generator starts with noisy pictures. Over the time the generator learns, which properties of images can trick the discriminator.

The first generated picture: just noise.

The Dataset

First I tried using the BMW10 dataset but with only 512 images the amount of data wasn’t enough to generate pictures, where cars could even be guessed, so I changed to the whole Stanford cars dataset with 8144 images, which has enough data to at least create mediocre pictures of cars. For the training the pictures were resized to 64x64 pixels.

Some pictures from the dataset.

The Code

In this part I will mostly focus on the part relevant for the GAN, so I start, after loading and transforming the dataset and moving it to the GPU. If you would like to see the whole code you can find it here. There you can also find the final weights. If you want to know more about how GANs generally work, you can find some information in my notebook or you can watch this video.

First we’ll look at some of the hyperparameters I defined at the beginning of the code:

The stats are used to normalize pictures and the latent_size is the size of the first input channel for the generator and represents more or less the features of the image, that is generated from it.

Next we’ll look at the discriminator:

The discriminator takes a 3x64x64 tensor as input. The 3 stands for the color channels and the 64x64 for the image size.

After that we use a convolutional layer to get to the new size of 64x32x32. Then we do batch normalization and use leaky ReLU as a activation function. While normal ReLU maps every negative value to zero, leaky ReLU maps any negative value to a value closer to zero (in this case to 0.2 times the value). We repeat this three more times and always double the first channel while halving the other two. After that we have a tensor size of 512x4x4 which we reduce to a 1x1x1 tensor. Then we flat out the tensor so we only have one dimension left and use the sigmoid function to get a value between 0 and 1. In our case 1 stands for a real and 0 for a generated image.

If you don’t get much (or even any) of this, don’t worry. Machine learning is a complex topic and the architecture of neural networks is one of the hardest things to understand, especially more complex networks like these. If you want learn more about neural networks, their architecture and machine learning in general take a look a this free course or just google a bit, there are many excellent sources on the internet.

Next we load the discriminator to the GPU (if available) and load the existing weights, if they exist:

The next step is the generator:

The generator is more or less the discriminator backwards. We start with a 128x1x1 tensor (128 is our latent size) and use a convolutional layer to get to 512x4x4. Then we do batch normalization and use ReLU (this time the normal ReLU, so any value less than zero becomes zero). We repeat this three more times halving the first channel and doubling the other two. After this three times we have a 64x32x32 tensor. Another convolutional layer resizes our tensor to 3x64x64, so the image size we also have in the dataset. After that we use the hyperbolic tangent function (tanh) which maps any value to a value between -1 and 1. This is an interval used to represent a color (similar to 0–255).

Then we load the generator in the same way we did with the discriminator:

If you would generate an image with the untrained generator the images would just be noise:

Let’s look at the training now, starting with the discriminator:

Lets look at it line by line:

The function gets passed a batch of images and an optimizer. Optimizers are really complex but in short they try to minimize the loss, if you don’t know what the loss is, don’t worry, we’ll talk about it soon, for now you can think of the loss as a metric for the success of the network. A low loss means the network is good, a big loss means it’s bad.

The first thing we do inside the function is clearing the gradient of the optimizer. If we want to make our network better, we need to calculate the gradients (or derivative) of the loss (the corresponding function is loss.backward(), we will see it soon). If we wouldn't reset the gradients, the gradients would just add up and make the gradients useless.

Next thing we do is passing the real images through the discriminator, the result of this is a 1 or a 0 (or to be more precise 128 times a 1 or 0, because the discriminator gives a prediction for every image in a batch). As a reminder: If the discriminator predicts zero, it thinks it’s a fake image, otherwise it thinks it’s a real image.

Next we’ll create the targets for the real images, which is just a tensor filled with ones. We need this targets for the loss function we’ll look at now:

There are many different loss functions in machine learning, I used the binary cross-entropy function for my project. But let’s take a step back first and look at what a loss function really is, although I will keep this as short and simple as possible:

A loss function compares the targets to the predictions of a neural network. The targets are also called ground-truth, because they are influencing the way the network learns and the goal of the network is to always achieve the ground-truth (so there is a high probability, that the network will also achieve a good score on data it has never seen the ground truth of). Imagine a dataset containing pictures of 10 different animals. The targets (also called labels) would be, which animal we can see on the picture. Because a computer can’t calculate with animal names there is a number for every label. So a bird could be 0, a dog 1, a cat 2 and so on. (The ground-truth can also be false, e.g. the label dog could be on pictures of whales, then the neural network would learn that a whale is a dog, I know that this example is quite absurd, but the point is, without good data and labels you will get a bad neural network, a problem I encountered too, more about that in the problems section.) The loss function now looks at the target, let's say its a bird (0) and the prediction, also 0. The target and the prediction are the same, so our network did everything right and the loss stays the same. In the next example there is again a bird (0) on the picture, but the network predicts a 2 (cat), the target and the prediction are different because the network did something wrong, the loss gets bigger, how much bigger depends on the loss function you use.

So let’s look at the used loss function: binary cross-entropy. I will simplify this a bit, think about it as a special kind of binary addition. The first number will represent the label, the second the prediction:

1 + 1 = 0; 1 + 0 = 1; 0 + 1 = 1; 0 + 0 = 0

As you can see the loss is zero if the label and the prediction are the same and one if they differ, so the loss gets bigger when the networks predicts the wrong label. (If you wonder why 1 + 1 = 0 just google ‘bitwise xor’.)

Another metric for getting the performance of a neural network is the score. If the score is 1 (or 100%) every prediction was correct. In our case we can just take the average of all predictions as the score, because if the average is 1, every predicted value was 1 and therefore right (a small reminder that we still only look at the real images). We’ll ignore the .item() here.

But enough about the real images, let’s look at some generated images now:

We create a latent, so more or less a tensor representing features and pass it to the generator which gives us a batch of generated images.

Now we do the same steps we did before with the real images only this time all targets are set to 0 because all images are generated.

Then we add up the loss of the real and generated images to get the total loss.

Next we see the loss.backwards() function I mentioned above. This function computes the derivative of the loss regarding to the parameters of the neural network (and sometimes also other things). This function is linked with the opt_d.step() function after it. The .step() function decides how to optimize the network based on the gradients.

At the end we return some metrics we want to keep track of.

We’ll now look at the generator training and you can find many similarities to the discriminator training:

Because we already looked at most of these lines and functions, I will go over this quickly. If you don’t get something, take another look on the explanations above.

The first differences are in the method head: Firstly we get the optimizer for the generator this time, secondly we don’t get any pictures from the dataset, which makes sense, because the generator should learn to generate pictures of cars not look at them.

Then we generate a batch of fake images, pass it through the discriminator, define the targets, calculate the loss and optimize the network based on it. The only thing that is different this time is that all targets are set to one. But didn’t we say that generated images have the label zero? Yes, but no. As I mentioned earlier the discriminator and the generator are competing: While the discriminator learns to tell real and generated images apart, the generator tries to fool the discriminator into thinking, that its generated images are real, the byproduct of that are generated images that look closer and closer to real images from the dataset. So why are the labels all ones now? The generator is successful when the discriminator interprets many generated images as real (so predicting 1) and in that case the loss shouldn’t get bigger as well, so the target needs to be 1.

Now we’ll take a look at the full training loop:

This method gets the number of epochs, so how many times it will calculate over the whole dataset, two learning rates (we’ll look at this shortly), one for the generator and one for the discriminator, and a start index, which will only effect the filenames of the saved images.

Then we empty the Cuda (more or less just the GPU) cache and initialize the values we want to keep track of: the generator and discriminator loss and the scores of the real and generated images.

Next we create two optimizers, one for the discriminator and one for the generator, and pass them the parameters of the corresponding network, the corresponding learning rate and betas (which I will not cover in this blog post).

Next we get into two loops. The outer loop loops through the number of epochs, the inner one through the data loader while always taking a batch of images. In the inner loop we then call the two training functions we discussed above. In the outer loop we then record and log the losses and scores and save an image with some generated images from the epoch.

After we went through both loops we return the collected losses and scores which can be used for further analysis (what we won’t do, we’ll instead look at some of the generated images, because, although I’m a computer scientist, I find cars more interesting than graphs).

Now let’s look at some more hyperparameters:

It’s time for a really short explanation of the learning rate: The learning rate indicates how fast the network learns, so a high learning rate would mean big steps between each epoch, this could lead to the network jumping around his goal in wide steps, a small learning rates means small steps, this could lead to the network getting better really slowly if the starting point is far away from the goal.

You might be wondering, why there are two different learning rates with the same value. I tried using different learning rates for discriminator and generator, but this were the learning rates I got the best results with.

I trained the networks three times for 250 epochs each, so in total 750 epochs.

I will spare you from the code cell in which I called the fit function, because the output with 500 lines of information about the epoch are not really pretty to look at (if you want to see them nevertheless you can find them in the notebook). The parameters I passed were the hyperparameters defined in the last code cell.

Let’s instead look at a picture generated:

If you look at the name of the picture you might think it's a picture created in the first epoch, but thats obviously not true. It’s the picture of the first epoch in the third training session, so epoch 501. (Yes, I successfully forgot about the start index parameter.)

Now that we more or less understood how the neural networks work, it’s time to look at some results!

Results

I trained the GAN for 750 epochs in total (3 times 250 epochs).

After 250 epochs some elements of cars where identifiable but the result was really poor:

The result after 250 epochs.

After the second 250 epochs basic elements of cars where clearly visible, e.g. tires, bodywork, windows, radiator grills and headlights:

The result after 500 epochs

As you can see, most images now clearly show cars, while some are hardly identifiable (e.g. (3, 1); (4, 2); (7, 6) (notation: row, column)).

A thing I had a lot of fun doing while the network was training was picking out a picture and interpret the generated cars in it: Which manufacturer could this car be from? Is there a similar looking car?

Some of the cars I interpreted in the picture after 500 epochs are: (1, 1) kind of looks like an old-school Rolls Royce mostly because of the really long looking engine hood; (4, 5) reminds me of the old Ford GT, as does the car left to it, although this also looks like it could be Lamborghini; (7, 7) looks like a Mercedes SLS to me.

Let’s look at the final result now:

The result after 750 epochs.

Now most of the images are quite obviously cars and in nearly every image many basic elements of cars are identifiable.

Some notable things about the images:

  1. You might have noticed the fire-like structure in many pictures (e.g. (1, 1); (7, 4); (5, 7)). Similar structures in different colors appeared many times through the training. This is basically the generator trying something to fool the discriminator and therefore changing some elements every time trying to get a better score.
  2. Something interesting, which you might have noticed when you’ve worked with GANs before, is that the cars stand in different position. Many of them diagonally to the right (8, 7) or the left (1, 7), some frontal (3, 1), some totally diagonal (7, 3) and some even backwards (6, 4). I will explain, why this is interesting but also a problem beneath in the problems section.
  3. Also interesting in regards to point two is, that each position has its own problems. The diagonally standing cars often have problems with the radiator grill (e.g. (8, 1) and (8, 2)), the frontal standing cars have problems with the tires (e.g. (3, 1)) and the totally diagonal standing cars are often shorter, than they should be (e.g. (7, 2) and (5, 4)).
  4. You also can see, that different types of cars are generated. SUVs, coupes, convertibles, compacts, sport cars, etc., American and European cars and even modern and older cars. This is because the dataset also contains these different types of cars.

This time, many of the cars look American and as a European I’m not familiar enough with them to interpret them from the images. But if you see some cars that look similar to a car you know, I would love to read your interpretations in the comments.

The first 250 epochs of training in a video.

A quite interesting thing I found looking through some images was a car that I interpreted as three different cars in less than 20 epochs (it is the car (2, 4)):

Epoch 533: Reminds me of a old Ford Mustang
Epoch 538: Could be a Mercedes AMG GT or a Jaguar
Epoch 550: Looks like a BMW, maybe a 5 Series, to me

As you can see, the results aren’t perfect and we’ll look at a few reasons for that in the next paragraph, but for my first bigger AI-project, I’m happy with it.

Problems

The probably biggest problem is the dataset or to be more specific, the way, the cars stand in the dataset. GANs work best on unified data, that’s why they work really good on for example the Anime faces dataset. In this dataset, every image has the same key attributes: Eyes, a mouth, hair and a nose.

Cars also have key attributes, but they change depending on the direction the car stands. On frontal standing cars the key attributes would be the radiator grill and the headlights, on the side the tires are key attributes and the back is best identifiable through the taillights. Now that our dataset contains cars from different sides (which is way better for object detection or classification than unified pictures, but for our project it’s a disadvantage) different key attributes are mixed, which generates many of the errors. On the other side, if you remove all pictures from the dataset except for example frontal ones, the data might not be enough anymore, but it’s definitely worth a try.

Another problem is the resolution. The neural networks work with 64x64 pixel images. This makes the network much faster than with higher resolutions, but also a lot of details from the original images are lost.

And last but not least the neural networks themselves might be a problem. This is my first bigger project using machine learning and PyTorch and although I learned a lot during course, I don’t know how to optimize the architecture of a neural network and how to choose the right hyperparameters yet. I mostly did try and error and this was the best result I could achieve.

Some work on at least one of the problems could lead to an even better result.

Sources

You can find my sources at the end of my notebook.

Next steps

Some further steps you could take, if you’re interested in the topic, and I would like to take, if I find some spare time, are working on the problems mentioned above and try to get a better result, using the existing weights and retrain them on the BMW10 dataset to create only BMW or doing some (AI-based) post-processing to make the resulting look better.

If you find other steps that could be taken from the current state, let me know in the comments.

Thank you for reading this article! I would love to hear some feedback in the comments. If you have any questions you can leave a comment or hit me up on Twitter.

--

--