A dummy’s guide to Deep Learning (part 3 of 3)


As I’m typing this sentence, a beautiful program is running on my mac in the background. My model is learning to do something we never imagined a computer could do. It’s not doing well enough yet, but it’s getting better every day.

Welcome to the 3rd part of this article! In case you haven’t seen part I or part II yet,
Part I: deep learning: a gold mine
Part II: deep learning 101
In this final piece, I’m going to walk you through a real example. We are going to see how we can train a deep model to recognize digits from images — even hand-written ones.


Let the training begin!
We are going to walk through a demo project written with the Torch framework. Torch is one of the most powerful tools for deep learning, and it has nice mobile integration so you can easily apply the models you trained with your workstation to a mobile app.
You don’t have to install anything right now, and it’s totally fine if you are reading this on your phone. I’m just going to show you a few core pieces of the code and explain how they reflect the concepts we have learned in part II, then after this reading you can start programming your own deep model with very similar code structure.
Torch applications are written in a language called LUA. Don’t worry at all if you don’t know about it yet — it’s probably the simplest programming language there is, and I bet you can start writing it after looking at the example below. The only gotcha is, array indices start at 1, not 0. You are welcome. Now you are ready to go!
Before we jump into the code, let’s give a simplified definition to the problem we are trying to solve:
Given a picture of a digit from 0 to 9, return the numeric value of the digit.
This definition excluded the cases where the picture is not a digit at all or has multiple digits in it, just to simplify the problem for demonstration purpose. Given this definition, the task now becomes a multi-class classification problem.
If we are to tackle this problem ourselves, we need to think about where we can get accurate training data. In this case, we need a lot of images of digits, and we need to know what digit each of them is. Luckily, there is a dataset called MNIST which is exactly for this purpose.
Now open up https://github.com/torch/demos/tree/master/train-a-digit-classifier.


There are only 2 code files: dataset-mnist.lua and train-on-mnist.lua.
The code in dataset-mnist.lua can be used to download and parse the MNIST dataset. It also does some simple normalization so that all the images look equally bright. I encourage you to read through the file and get familiar with LUA, but for now we’ll skip this part and focus on the other code file train-on-mnist.lua, which is where the real party is.
There are already a lot of comments in train-on-mnist.lua which are useful for understanding its logic, but here let me explain the core parts of the program.
First, this code supports training with different types of models. “convnet” is the convolutional neural network model that we want to use. The following code sets up the structure of the network.
Let’s look at the first function call:
nn.SpatialConvolutionMM(1, 32, 5, 5)
SpatialConvolutionMM is the block-by-block feature detectors or neurons we talked about in part II. This sets up the first layer of the model.
The first parameter “1” means input only has 1 “channel”. Usually colored images have three channels R, G and B. But for recognizing digits we only need to look at images in black and white, which have only one channel for greyscale.
The second parameter “32” means we have 32 feature detectors in this layer that can look for 32 different patterns. It should be enough for detecting all kinds of edge angles and solid-color blocks.
The 3rd and 4th parameters “5” and “5” means we scan images with 5 x 5 blocks.
This first layer would produce feature maps with values indicating how similar each block is to the feature detector’s target pattern. It’s like saying “this block is 75% similar to a horizontal line”.
Now the second layer:
nn.Tanh()
Tanh is a function that behaves like a real neuron. The function returns something close to -1 most of the time, but after a certain threshold, it suddenly starts to return values close to positive 1, which mimics a neuron “firing” a signal. It’s like saying “yes, this block is a horizontal line”, which hides the linear similarity value reported by the previous layer and kind of reports a boolean instead. This kind of function is called a transfer function. There are other transfer functions we can use as well. See https://github.com/torch/nn/blob/master/doc/transfer.md. ReLU is also a popular choice.


The third layer:
nn.SpatialMaxPooling(3, 3, 3, 3)
SpatialMaxPooling is a function that looks at an M x N block at a time, and if any of the neurons in this block is “firing” a signal, it would consider this whole block as “firing”. This would make the model more tolerant to offsets. Here’s a Quora post that better explains the purpose of max pooling: https://www.quora.com/What-is-pooling-in-a-deep-architecture.
This SpatialConvolutionMM + Tanh + SpatialMaxPooling combination typically comes as a whole stage in a deep neural network. As seen in the code snippet, there’s a second stage that does the same routine, but it will be able to find 64 larger and more complex patterns based on the 32 simple ones produced by stage 1. See the parameters:
nn.SpatialConvolutionMM(32, 64, 5, 5)
Can you stop for a moment and explain the meaning of each parameter in that function call?
Wow, look at you! You are getting good already!
Stage 3 is called a fully connected layer. It basically takes all the 64 x 2 x 2 neuron signals from stage 2, and maps them to 10 outputs. Each of the outputs stands for a digit. So if the first output node reports a high value, it means the model thinks the image has a good chance of being digit “1”.
Further down below in the code file, there’s a snippet like this:
Let’s not worry about what a loss function means yet. Look at the first function call:
nn.LogSoftMax()
LogSoftMax is a function that takes the 10 values we generated in the last layer, and converts them into 10 real numbers that sum up to exactly 1. Now these 10 numbers can be interpreted as probabilities. If the first value turns out to be 0.7, we can consider it as the model thinking there’s a 70% chance the image is of the digit “1”.
That’s the structure of the model. Difficult tasks would require more layers, more neurons and therefore more compute power to train and run. You will need to look for a good balance when you try to solve real problems on real hardwares, so keep in mind that it’s our responsibility to design the structure of the network.
The second function call above is not really part of the model though. It’s how we are going to adjust the model.
criterion = nn.ClassNLLCriterion()
This part is also called “the loss function”. The job of a loss function is when given an input for the model — in this case an image from the MNIST dataset, it looks at the correct output of the model — here it’s the pre-labeled correct output we get from the dataset, and also looks at the actual output — the 10 probability numbers we got from LogSoftMax which stand for what digit our model thinks the image is, and the loss function would return a value that indicates “how wrong” the model is. The training process is just trying to minimize the output of this function, so that our model is “less and less wrong”.
For example, if the image is of digit “3”, and our model looks at the image and says it’s 100% digit “8”, then it’s totally wrong. The loss function would return a high value. However if the model says the image is 30% a “8” and 70% a “3”, then it’s not that wrong. The loss function should return a lower value in this case.
ClassNLLCriterion is only one type of loss function that we can use here. It’s a good choice for a classification problem. For other problems like regression, people use other loss functions like “mean square error” to describe how wrong the model is. You can see more types of loss functions here: https://github.com/torch/nn/blob/master/doc/criterion.md.
Further down the code file, you might run into this line that can be literally confusing:
confusion = optim.ConfusionMatrix(classes)
This is just a helper object that records how many times the model was right and how many times it was wrong. This can be printed to the screen and give us a quick look at the performance of the model during training and evaluation.
Moving on, the next interesting chunk is inside the train() function:
The idea here is that instead of showing the model one example at a time, we show a batch of examples to increase efficiency. If you read this snippet carefully, you will realize it’s creating a batch of inputs stored in the local variable “inputs”, and also same amount of pre-labeled correct outputs stored in “targets”.
Next we will send this batch of inputs to the model and see what the model says:
Don’t worry too much about gradients for now. The real action here is
local outputs = model:forward(inputs)
This sends the batch of inputs to the model, and get all the outputs in a single run. And then we run the loss function to evaluate how wrong the model is:
local f = criterion:forward(outputs, targets)
Then we “propagate” the error back to the model, which basically tells the model to calculate how much each neuron needs to change in order to eliminate the error. That’s what these 2 lines do:
local df_do = criterion:backward(outputs, targets)
model:backward(inputs, df_do)
You are encouraged to read more about this to get a deeper understanding, but most likely you will do this part exactly like that for a lot of applications.
Once the model realizes how each neuron should be corrected, we need to do the actual correction and adjust the weights and parameters of the neurons. But by how much? Do we adjust it all the way so that the model will output the expected value just for this one image? No. If we do that, our model will always return the correct answer for the last image we showed to it, while do terribly for all other images. A good way to do this is to adjust towards the right direction only one tiny bit at a time. This is what the optimization algorithms do:
LBFGS and SGD are two different optimization algorithms. The default behavior of this program uses SGD which is a good choice. You should read more about different optimizations if you want to make an optimal choice for your application.
That’s it! We defined the model structure, we kept showing the model batches of images and kept correcting the model towards the right direction. That’s how we train a model.
As an exercise, I recommend you install torch on your workstation, download the two code files in the demo project, modify the code file and insert print() statements everywhere to print out information that you are interested in, and run the code yourself. Watch how the confusion matrix changes as the model gets better and better.
Finally, now that we know how deep learning works, what can we do with it?
That, is for you to decide.
Thank you for reading! If you enjoyed the article, please click on the little green heart below to recommend it so that more people can see it. Follow The Bleeding Edge to stay up-to-date on latest technologies and inventions!