# Building a Deep Handwritten Digits Classifier using Microsoft Cognitive Toolkit

The MNIST handwritten digits classification problem has long been used as the “Hello World” of machine learning. The recently updated Microsoft Cognitive Toolkit (CNTK) is also using it as one of its introductory tutorials. However, it was only used as an example for a simple neural network. For the convolutional neural network (CNN) tutorial, the CIFAR-10 image classification problem was used instead.

I decided to learn how to use CNTK by replicating the Tensorflow tutorial on the Deep MNIST handwritten classifier. My implementation is based on the two CNTK tutorials, so you might find some similarities between my code and theirs. My goal was to demonstrate how to build a basic CNN to solve a simple classification problem. Therefore, I tried to only keep the necessary parts from those tutorials and use the default parameters whenever possible. In addition, I made the function and variable names much more verbose and descriptive to make the code more readable.

Before we begin, we need to install CNTK. I found that the easiest way is to download the CNTK binary and run the installation script. It will install CNTK along with Anaconda 3 and Python 3.4 automatically. For this tutorial, I was using “CNTK for Linux v2.0 Beta2 CPU only” and running it on Ubuntu 16.04.1 LTS.

### 1. Import the necessary Python modules

Below shows all the Python modules that we will need for this tutorial. *gzip*, *os*, *struct*, *urllib* are just for loading the data set. For building the neural network, we really only need to import *numpy* and *cntk*.

### 2. Loading the MNIST data set

I really like how Tensorflow made it very easy for new users to try using it by including many popular data sets as part of the library. Importing the MNIST data in Tensorflow only took two lines of code. On the other hand, for CNTK, we have to download and process the data sets ourselves. In fact, the data loading parts of the code are way longer than that actual machine learning parts!

The five functions below essentially enable us to download the MNIST handwritten digits training and testing data sets from Yann LeCunn’s website, save them in a local folder named “MNIST”, and convert them to the text format required by CNTK Text Reader. Most of these code were copied from the CNTK data loading tutorial. However, instead of re-downloading and re-processing the data sets every time we run the program, these functions look into the local directory first and load the local copy if it exists.

Just one more step before we get to the machine learning part.

Each image sample in the MNIST data set is 28 x 28 pixels large and greyscale (thus, 1 channel). This means that the input to the CNN has 784 (= 1*28 *28) dimensions (*input_dim*).

image_shape = (1, 28, 28)

input_dim = int(np.prod(image_shape, dtype=int))

Each image can be classified as a digit between 0 and 9. Using one-hot encoding, we have a total 10 output neurons (*output_dim*), each representing the probability that a given image is one of the 10 classes.

output_dim = 10

According to the MNIST database, the number of training and testing sets are 60,000, and 10,000 respectively, and we are going to use all of them!

num_train_samples = 60000

num_test_samples = 10000

After specifying these constants about the data sets, we read in the data (either loading from file or downloading from web) and convert them to a format that is usable by CNTK using the functions defined above.

### 3. Construct the convolutional neural network

Constructing a neural network is actually very simple in CNTK. Unlike Tensorflow, many variables and parameters are defined implicitly. For instance, it is not necessary to define the weight and bias variables; those are created implicitly for us when we create a convolutional layer.

The first convolutional layer computes 32 features from 5 x 5 patches with stride size of 1 and padding. We then feed its outputs into the first pooling layer, which does max pooling over 2 x 2 blocks.

convolutional_layer_1 = Convolution((5, 5), 32, strides=1, activation=cntk.ops.relu, pad=True)(input_vars)

pooling_layer_1 = MaxPooling((2, 2), strides=(2, 2), pad=True)(convolutional_layer_1)

The second convolutional layer is essentially the same as the first one except that it computes 64 features. Both layers use the classic ReLU function as their activation functions.

convolutional_layer_2 = Convolution((5, 5), 64, strides=1, activation=cntk.ops.relu, pad=True)(pooling_layer_1)

pooling_layer_2 = MaxPooling((2, 2), strides=(2, 2), pad=True)(convolutional_layer_2)

Each convolutional layer computes small patches of a given image independently. Then at the subsequent pooling layer, those small patches are pooled into a smaller number of bigger patches to capture the relationships among the different patches. After the two layers of pooling, the total number of patches is significantly smaller compared to the number of raw pixels. We then feed the outputs of all of those patches (or neurons) into a fully-connected layer where all patches that make up the entire image can be processed at once. Our fully-connected layer has 1024 neurons.

fully_connected_layer = Dense(1024, activation=cntk.ops.relu)(pooling_layer_2)

A dropout layer is added to prevent over-fitting during the training stage. It works by turning off a random set of neuron in both the forward pass (i.e. setting their activation values to 0) and backward pass (i.e. not updating their weights). In CNTK, the dropout layer is automatically disabled during testing.

dropout_layer = Dropout(dropout_prob)(fully_connected_layer)

Finally, the output layer, which is made up of 10 neurons, outputs the 10 values. Once the CNN is trained, each output variable represents the softmax probability of each class of a given image.

output_layer = Dense(out_dims, activation=None)(dropout_layer)

As some of you might have noticed, I configured this to be identical to the one presented in the Tensorflow Deep MNIST tutorial. I have no explanation on why this particular configuration and CNN structure were chosen.

We also need to define a placeholder for the input variables that will be fed into the CNN during the training and testing processes.

input_vars = cntk.ops.input_variable(image_shape, np.float32)

The code below shows how everything is put together.

*Extras*

*We can speed up the training process by initializing the weights with some random values drawn from a Guassian distribution and biases with a small constant value.*

### 4. Set up the model trainer

The model trainer requires the training labels as input. It is basically an array of 10 (*output_dim*) numbers. Again, it must be defined as a placeholder (*cntk.ops.input_variable*) since we are going to feed it data later during the training process.

labels = cntk.ops.input_variable(output_dim, np.float32)

Rather than adjusting the weights and biases one training sample at a time, we train in minibatches of data. This allow us to optimize for the average loss at each back prorogation pass, which enables training to complete faster and be less affected by noise in the data.

The learning rate and momentum affect how much the values of weights and biases change at each back propagation pass. In general, a higher learning rate or momentum makes the training process less affected by noise at the cost of longer training time.

train_minibatch_size = 100

learning_rate = 1e-4

momentum = 0.9

We then specify the loss function for the trainer. Similar to the Tensorflow tutorial, we use the cross-entropy loss function here. For those who are interested, this article explains why using cross-entropy as the loss function is better than using classification error.

loss = cntk.ops.cross_entropy_with_softmax(output, labels)

In addition, we need specify the function that computes the classification errors so that we know how well our classifier is doing.

label_error = cntk.ops.classification_error(output, labels)

Finally, we instantiate the trainer. Here we use the Adam Stochastic Optimizer to compute the gradient descent.

learner = cntk.adam_sgd(output.parameters, learning_rate, momentum)

trainer = cntk.Trainer(output, loss, label_error, [learner])

### 5. Train the model

With both the convolutional neural network and the trainer set up, we are at last ready to train the model!

In each epoch, we feed the trainer a new minibatch of 50 samples every loop until we have used all 60,000 training samples. The trainer adjusts the weights and biases through back prorogation to reduce the training error at each iteration.

### 6. Evaluate the classifier

After maybe around 5 to 10 minutes (depending on how fast your computer runs) of training, we have our handwritten digits classifier at last!

We can feed in the test data into our classifier and see how well it performs. In this particular implementation, I got an average classification error of around 1.46% error after just 1 epoch and 1.03% error after 2 epochs!

I hope this tutorial has been useful in getting you started on using CNTK. You can find the entire implementation here on my Github. Feel free to download the code and make it even better!