Doodle Drawing

Cole Kaplan
8 min readMay 6, 2024

Created by Cole Kaplan, Andrew Dean, and Michael Stewart

We were inspired by the Google QuickDraw game and the incredible dataset of hand-drawn images that it created. We set about to create an AI to not only categorize images like the game but also generate new images that look hand-drawn. We played around with many model architectures, as you will see, but settled on a convolutional neural network (CNN) for the classifier and a general adversarial network (GAN) for the generation.

Data Pre-Processing

We wanted to use the stroke data, but as a 256x256 bitmap, so we had to convert all of the files, set them to the correct size, and store them as .npz files.

Google Cloud

This project would not have been possible without Google Cloud servers. Using the free trial feature, we set up cloud storage buckets for our data and vertex ai workbenches with 16 vCPUs and 104 GB RAM, and still, it struggled with some of our models or took a full day to train. We used over $450 of the free trial money and would have spent more but there was a limit to only CPUs and only a certain number of them per account.

Classification with the CNN

Due to computation power restrictions, we had to settle on 15 categories: angel, apple, broom, candle, car, diamond, fish, flower, lightning, rainbow, smiley face, star, stop sign, sword, and tree. As you can see, we tried to pick some that would look very different from each other and some that would look similar to see how our models fair for those highly similar-looking and highly different-looking categories.

Model 1:

Went horribly wrong due to logical errors. Thrown out immediately.

Model 2:

Next, we trained model 2. This was trained on 4000 instances of each category that Google’s model was able to recognize (their model has an accuracy of 80%). During preprocessing, we vertically centered the images on the canvas because we realized that we could just center all the images drawn in our UI part before having our model predict on them.

The model architecture for model 2 was very straightforward. We drew inspiration from the architecture used for classifying the mnist number data set because we figured the data was formatted similarly (even though our images were much bigger at 256x256 compared to 28x28 pixels).

Input: 256x256

Layer 1: 64x64, kernel 3x3, with kernel regularization = .01

Dropout 25%

Layer 2: 32x32, kernel 5x5, with kernel regularization = .01

Dropout 25%

Output: Flatten to 15 neurons with softmax

We trained it through 5 epochs with a batch size of 128 and were surprised to see that the accuracy of the validation data was highest for the very first epoch. After the first epoch, the training accuracy continued to increase while the validation accuracy stayed the same or decreased. This is indicative of overfitting:

Model 3:

In order to combat overfitting, we then increased the dropout rate to 80%. We also found a more generalized version of our data set to train on: we trained on 5000 images that had higher variation than the previous models training data (meaning that the Google model was able to recognize 90% of our data instead of 100% of our data as was the case for the training data for model 2). During preprocessing, we centered the images both vertically and horizontally. We trained it through 8 epochs using a batch size of 128. Here are the results:

Note that even with the more generalized data that potentially contained worse drawings that would be harder to classify, model 3 performed at around 85% accuracy on validation data while model 2 performed worse, getting 79% accuracy on its validation data. However, as you can see, our model started overfitting after epoch 3 and our model performed best after 3 epochs. Therefore, we used the model from Epoch 3 in our UI.

Let’s analyze the epoch 3 model’s performance a bit more. Here is the model accuracy per category and confusion matrix for the epoch 3 model. You can see the accuracies for each category along the diagonal of the confusion matrix as well.

Observe from the confusion matrix that this model still performs below 80% on flower, lightning, stop sign, and sword. We believe it tends to confuse these because they all have long stems and squiggly masses at the top. Look at the four examples below:

Next Steps:

Some next things we would like to try is downsampling our data to 128x128 or 64x64 before putting it into the model. Our drawing data has either black or white pixels. We also are considering applying a Gaussian blur to the data before putting it through the model. The images would then also have grey pixels, and we believe that the model would be more generalizable if trained on this kind of data because it would learn to be more forgiving if a new input it is given has pixels colored in that were not colored in any of the training data.

The GAN

Our GAN model architecture contains the Generator, Discriminator, and GAN, trained on 50000 vertically centered data points per category:

Generator:

Input Layer: A random noise vector of size 100

Dense Layer (128 units)

LeakyReLU Activation with an alpha (slope) of 0.01

Output Layer (Dense layer) with 256*256*1 neurons with a hyperbolic tangent (tanh) activation function to ensure [-1,1] output pixels

Reshape Layer: The output of the last dense layer is reshaped to match the desired image shape

GaussianNoise with a standard deviation of 0.2.

BatchNormalization layer to stabilize and accelerate the training process

Discriminator:

Input Layer: The input layer of the discriminator receives images, either real or generated by the generator.

Flatten Layer: The image is flattened to a 1D vector before being passed to the dense layers.

Dense Layer (128 units): The flattened image vector is passed through a dense layer with 128 neurons.

Leaky ReLU Activation: Similar to the generator, Leaky ReLU is applied after the dense layer.

Dense Layer (Output Layer): The output layer is a single neuron with a sigmoid activation function. This neuron outputs a value between 0 and 1, representing the probability that the input image is real (1) or fake (0).

Gan:

The GAN combines the generator and discriminator into a single model.

Generator: The generator model is added as the first component of the GAN. It takes random noise as input and generates fake images.

Discriminator: The discriminator model is added next. It takes the generated images from the generator (fake images) and real images from the dataset as input and outputs a probability for each indicating its authenticity.

Loss function: Binary cross-entropy, optimized using the Adam optimizer.

Gan: Successes/ Failures

While the model listed above is our ending model, we first trained a model with less data per category (5000) and no Gaussian Noise or Batch Normalization included in its architecture. Photos on the left illustrate our ending model at the same epoch as our less advanced model (shown on the right).

Next, we sought to determine the best results by epoch for generating an Angel; the results are shown below. For the first few thousand epochs, we saw continuous improvements but then our model began to develop harsh lines and lose quality. We determined that epoch 3000 yielded the best results.

1,000
3,000
5,000
7,000
10,000

The Model Graveyard

The two models that we hoped to work that didn’t were the variational auto-encoder (VAE) and conditional GAN (cGAN).

With the variational auto-encoder, we hoped to create a latent space that would both allow us to generate new images and classify images, and even maybe map an incomplete drawing to the latent space and use its proximity to a latent space category to complete the drawing. However, the VAE took too much time to train, taking hours on just 28x28 data which was blurry and hard to make out, when we wanted to use 256x256 data. We might be able to go back to this at some point, as our efforts on the VAE were before we discovered Google Cloud.

The conditional GAN was supposed to be a way to encode the category into the noise vector for a GAN so that we could control what image it produced. This one failed a few times due to our lack of skill with data manipulation. After a lot of reshaping and reconstruction, we finally made the models work, and by then we assumed it had lost its underlying meaning because the generated images for any category are pure black even with a long training time.

VAE output:

The VAE was data intensive so it used blurry 28x28 images that were hard to make out.

cGAN output:

We were forced to reshape it to fit the model architecture that we didn’t fully understand. It must have lost data integrity in the process.

The Game

We made a game out of our classifier that allows the player to draw a category while the AI guesses what they are drawing. When the model guesses your drawing, you can move on to the next category.

Check out our demo:

Access Our Data

Thank you so much for taking the time to read about our research!

If you are interested in continuing this project or just exploring what we did, we wanted to have a GitHub but the model size was too big for a repository.

Sorry :(

You can reach out to us at:

colegkaplan@gmail.com

dean@middlebury.edu

mstewar1@hamilton.edu

--

--

Cole Kaplan
Cole Kaplan

No responses yet