Numeric Digits, Handwritten by a Computer

Christopher Messier
7 min readJan 2, 2018

--

Getting from this:

We start with random noise...

To this:

… and end up with “handwritten” digits.

There is a lot of discussion surrounding artificial intelligence these days. Not a day goes by without news of a firm disrupting some industry through the application of AI and machine learning. There’s certainly good reason for this. By utilizing all the available data, a computer is able to recognize patterns that would simply be almost impossible for a person to ascertain. There is no doubt that computers are able to understand, but what about a computer’s ability to create something new?

The use of deep learning for image generation has fascinated me since reading this paper. Getting the chance to work on something larger for my capstone project with General Assembly’s Data Science Immersive program, I wanted to build an image generator. During the program I had developed an interest in the application of deep learning to machine vision, so it seemed like building an image generator would be a natural extension, but it wasn’t until I had a conversation with one of the instructors that things started coming together. He brought Ian Goodfellow’s work with Generative Adversarial Networks to my attention, and it was exactly what I needed to get things started.

Generative Adversarial Networks (GANs) are a type of deep learning model that utilizes two neural networks that work with one another to learn and generate images. The way this works is by having each network to perform a specific task. One learns to identify the objects in the images, and the second generates noise similar white noise on a TV. These two networks are called the discriminator and the generator respectively. How they actually end up generating an image is that as the discriminator learns how to classify the images, the same process that it uses to “train” (back propagation) also updates the generator; so the generator begins to “shape” the noise it’s outputting to match the learned “shape” of the object. As the model trains, it starts producing images that are closer and closer to the images that were used to train it.

It takes some time to learn how to write…

My original goals for this project were quite lofty. I had wanted to build a complete web app for image generation, where a user would use a web interface to request a generated image of some object. The software would return the generated image of the object to the user. It turned out that this was a bit too much to complete in the allotted time frame. While I was committed to my original goal, I needed to focus on something that could more realistically be completed by the due date. My hope was that this would then serve as the foundation as I move on to create the complete application. What I finally settled on was generating “handwritten” numerical digits based on the MNIST data set.

For those who are unfamiliar, the MNIST dataset is a collection of small black and white images of hand-written numerical digits. These hand written digits serve as one of the seminal datasets in teaching computer vision. The small size and intuitive labeling make this perfect data to serve as an introduction to image classification tasks, as it does not require significant computational resources to analyze the images. That combined with the fact that all the images are generated by real people, and thus contain the underlying variation needed to create new images means that using the MNIST dataset to train the model is ideal to use as a proof-of-concept for the final product. There’s also another reason that the MNIST data made the perfect choice

Recently Google released an open source repository for GANs, TFGAN. With this came a model that was pre-trained on the MNIST data. As this was a project that had to be completed in a rather tight deadline, having a pre-trained discriminator made the task significantly easier, and provided more time to debug the model when something inevitably went wrong.

To construct the model, I relied heavily on the resources provided with TFGAN to serve as guide as I completed the project. Still, I wanted to make sure this was something that could be used in a larger software environment, so I made sure the end result could easily be adapted to serve as the foundation for the final version.

The completed code can be found here, but I also want to provide a brief walk through of the various components. For a demonstration, I recommend using the notebook that goes along with the project, here.

The project is spread across 4 python files, cgan.py, tf_nets.py, evaluate.py, and visualizer.py. cgan.py is the main program, the file that runs the model. evaluate.py is where the performance of the model is evaluated, and visualizer.py is used to render the images the model generates, as well as visualize the model's performance. We'll start out with tf_nets.py though, where the discriminator and generator networks are defined.

Within tf_nets.py there are two functions, aptly named generator() and discriminator() which are used to define the TensorFlow graphs prior to running the model. Simply calling these functions in cgan.py is all that is needed to build the graph structure of the model. (Those unfamiliar with TensorFlow should start here).

The generator network, the net work that will be producing the images, takes as inputs noise, and the proper image labels. The model being used is a particular type of GAN called a conditional GAN; where the model will output a particular image, given some input label. The architecture of this network should be familiar to anyone who has worked with convolutional networks before. It has two fully connected layers, that are then fed into two convolutional layers. All of the hidden nodes use ReLU as their activation functions, but the output layer uses a hyperbolic tangent instead.

def generator(inputs, weight_decay=2.5e-5):
"""
conditional generator used to produce mnist images
:param inputs: A 2-tuple of Tensors (noise, one_hot_labels).
:type inputs: The value of the l2 weight decay.
:param weight_decay:
:type weight_decay:
:return: image in the range [-1, 1]
:rtype: tensor
"""
noise, one_hot_labels = inputs with slim.arg_scope([layers.fully_connected, layers.conv2d_transpose],
activation_fn=tf.nn.relu,
normalizer_fn=layers.batch_norm,
weights_regularizer=layers.l2_regularizer(weight_decay)
):
net = layers.fully_connected(noise, 1024)
net = tfgan.features.condition_tensor_from_onehot(net, one_hot_labels)
net = layers.fully_connected(net, 7 * 7 * 128)
net = tf.reshape(net, [-1, 7, 7, 128])
net = layers.conv2d_transpose(net, 64, [4, 4], stride=2)
net = layers.conv2d_transpose(net, 32, [4, 4], stride=2)
# Make sure that generator output is in the same range as `inputs`
# ie [-1, 1].
net = layers.conv2d(net, 1, 4, normalizer_fn=None, activation_fn=tf.tanh)
return net

The discriminator has a similar architecture to the generator network. Once again, the discriminator is the network that is used to classify the image that the generator is outputting. It is a deep neural network with two convolutional layers but with only one fully connected layer.

def discriminator(img, conditioning, weight_decay=2.5e-5):
"""
Conditional discriminator network on MNIST digits.
Args:
img: Real or generated MNIST digits. Should be in the range [-1, 1].
conditioning: A 2-tuple of Tensors representing (noise, one_hot_labels).
weight_decay: The L2 weight decay.
Returns:
Logits for the probability that the image is real.
"""
_, one_hot_labels = conditioning
with slim.arg_scope([layers.conv2d, layers.fully_connected],
activation_fn=leaky_relu,
normalizer_fn=None,
weights_regularizer=layers.l2_regularizer(weight_decay),
biases_regularizer=layers.l2_regularizer(weight_decay)):
net = layers.conv2d(img, 64, [4, 4], stride=2)
net = layers.conv2d(net, 128, [4, 4], stride=2)
net = layers.flatten(net)
net = tfgan.features.condition_tensor_from_onehot(net, one_hot_labels)
net = layers.fully_connected(net, 1024, normalizer_fn=layers.batch_norm)
return layers.linear(net, 1)

These two networks form the core components of the model that is run in cgan.py. Before we turn our attention there though, let's look at evaluate.py, where the performance of the model is measured.

The way the model is evaluated is with the common cross-entropy loss function. There are two instances of this loss function however, one for each of the networks.

def gan_loss(gan_loss, name=None):
"""
evaluate loss
:param gan_loss: GANLoss tuple
:type gan_loss: GANtuple
:param name:
:type name:
:return:
:rtype:
"""
"""
Evaluate GAN losses. Used to check that the graph is correct.
Args:
gan_loss: A GANLoss tuple.
name: Optional. If present, append to debug output.
"""
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
with slim.queues.QueueRunners(sess):
gen_loss_np = sess.run(gan_loss.generator_loss)
dis_loss_np = sess.run(gan_loss.discriminator_loss)
if name:
print('%s generator loss: %f' % (name, gen_loss_np))
print('%s discriminator loss: %f' % (name, dis_loss_np))
else:
print('Generator loss: %f' % gen_loss_np)
print('Discriminator loss: %f' % dis_loss_np)
return gen_loss_np, dis_loss_np

There’s one more file in the project, visualizer.py.

The multiple functions that are found in visualizer.py are simple plotting utilities to render and save the different images. These were constructed using simple plotting functions found in matplotlib, though, so we'll simply turn our attention to the execution of the model in cfgan.py

The model file is fairly long, so I will not reproduce it here, but I recommend running through the notebook that is associated with the project to help get a sense of how the model is structured.
Regardless, using this simple model I was able to get from this:

To this:

The results here are from 5000 iterations of the model. It takes a lot of time to learn how to write, but the results are recognizable as “handwritten” digits. While this is a rather simplistic example, it shows the process of going from random noise, to recognizable digits. Most importantly, this same process can now be adapted to generate more complex images. I’ve already begun to adapt this framework to begin generating images from the CIFAR10 data set, a collection of color images of 10 different objects; and I’ll post the results soon. My goal is to have this running shortly, so check back for updates.

--

--