The Brogrammer’s Guide To Deep Learning #1

One of the most interesting things to play around with in programming is deep learning, which has been making the rounds in the media and is quickly becoming the latter half of this decade’s most prominent buzzword. We are constantly hearing from researchers and the public relations departments of corporations about novel ways to apply this technology. VCs and the ever annoying techno-optimists, write blog posts opining about its potential use in medicine, automation, and scientific research. In this series we are going to be focusing on something far more practical … HOES.

My favorite porn site right now is this amazing soft-core porn sight where the premise is that all the women on this site are real girls. They are not plastic bimbos that set unrealistic expectations that damage the self esteem of women and distort men’s standards. These are the type of women you see at school, work, and church. I’m a bro, but that doesn’t mean I can’t be a feminist, and I urge you to having your pornography needs filled by Facebook.

While a great tool, Facebook has some serious drawbacks. At first you think you’ve discovered a life hack, when you programmatically download pictures of a cutie’s spring break album only to find that you are faced with a conundrum. For every picture like this.

The ideal body type for a woman.

There is a picture like this.

A dog. What the fuck Natalie.

You end up spending an inordinate amount of time sifting through photos in only to find this sort of filth. I mean some of these pictures even include children! Ideally we’d want to write code that told us whether or not the photo should be added to our ICloud. It would be impossible for you to write a traditional algorithm, that could analyze a picture and tell you about its contents. Enter deep learning. By the end of this lesson you should be able to implement a feed-forward neural network in Python.

Like women, this guide should be easy. I am not going to assume any background in machine learning or mathematics. The only requirement to follow along with this series is some basic knowledge of Python. I do recommend that you use the Anaconda distribution from Continuum as it makes it far easier to install the necessary packages, especially on Windows.

Although Deep Learning can seem intimidating at first, it boils down to four basic building blocks. Every machine learning problem can be defined in terms of it’s training data, it’s score function, it’s objective function, and it’s optimization algorithm. The scoring function produces a value for each potential class label. The objective function is some differentiable function that tells us how well our algorithm is doing using the training data as a benchmark. The optimization algorithm adjusts the parameters of the algorithm in a direction that pushes the objective function in the correct direction.

1. The Data

When it comes to supervised learning problems your training data is going to take the form of input and output pairs. For our specific example each {x,y} pair is going to be an image represented in a way that it makes sense to perform mathematical operations on and a number representing the appropriate label.

2. The Score Function

The score function of a neural network is a composition of linear and non-linear matrix operations. You may have seen images of neural networks that look like this.

A Three Layer Network

As we can see from the above image each of the “neurons” in the previous layer sends a signal to each of the neurons in the next layer. Each of their signals have a weight attached to it, and the weighted sum of all the inputs is computed in the neuron of the next layer. A bias term is added to the neuron to see if it fires. The magnitude and direction of the bias represents how easy or hard it is for it to fire. A non-linear activation function is then applied to the operation in the hidden layer, which allows the network to learn non-linear relationships.

The linear computation can be represented using linear algebra by the following function.

Linear Classifier

If we look at it in matrix form we can see how this operation is the equivalent to the flow of weighted signals we observed in the neural network diagram above.

Linear Classifier Matrix

Matrix notation may seem intimidating at first, but at the end of the day it really is just a compact way of expressing basic arithmetic.

As we can see in the above illustration of matrix multiplication that the operation mirrors the activity of an artificial neuron. Each row of the x matrix represents a signal from the lth layer. Each column of the weight matrix represents the weights a neuron in the (l +1)th layer applies to the incoming signals. They are multiplied together and summed up to form the outputs. The bias vector is just added element-wise to each row in the product. A non-linear activation function is then applied to this result.

As of today the most popular activation function is the rectified linear unit, it trains networks faster as it tends to allow gradients to flow better than the alternatives.

Rectified Linear Unit Activation Function
Rectified Linear Unit Graph

The process is repeated until the last layer, where a different activation function is applied. At the last layer the softmax function is applied, which allows the scores to interpreted as probabilities.

The Softmax Function
Softmax Outputs

3. The Cost Function

The cost function is a differentiable function that gives us a numerical measure of how well our neural network is performing. For classification tasks we are going to want to use the cross entropy function. This function takes the negative log of the probability the scoring function computes for the correct class.

The Cross Entropy Loss Function

A glance at the graph of this function makes it obvious why we would want to use it.

Graph of Cross Entropy Function

When the classifier is 100% confident in predicting the correct class the loss of that particular example is zero. As the classifier becomes less confident in the correct class the loss approaches infinity. We average the loss of all the individual training examples in order to come up with the final loss.

Most of the time aside from the data loss, you are going to want to add a regularization component your loss. Regularization is a technique to prevent your network from over-fitting.

The total loss is derived by the regularization loss being added to the data loss.

4. The Optimization Algorithm

Before a network is trained it’s predictions are random guesses, produced by weights and biases we picked at random. We should expect the guesses to be uniformly distributed among the classes, so for a classification problem with three categories we should expect that the accuracy should be around 33.33%. We are going to use our cost function to tell us how poorly our guesses are and will slightly change our guesses as to the appropriate weights and biases in directions that will improve it’s performance.

We do this by using the signal that each weight and bias transmits onto the cost function. Theses “signals” are called gradients, and they can be used to tell us the direction in which the weights and biases need to be moved in order minimize the loss.

The gradients are computed via backpropagation, which is basically taking advantage of the fact that the cost function is a composition of a composition of functions, and allowing us to use the multivariate chain rule. Gradient computation is the most difficult part of deep learning, and it helps to think about it in terms of computational graphs.

We multiply the gradient, by a small number, called a learning rate and subtract the product from the current value of the weights to improve the scoring.

Backpropagation and Computational Graphs

The backpropagation algorithm can be very intimidating when looked at in terms of the complex formulas you may encounter searching Google. The idea is easier to understand when thought of in terms of computational graphs and easier still when transported to code. Lets look at the composition of some very simple multivariate functions to see the chain rule in effect.

The last function j(k, g) is known as a We want to know how much of an impact a change in x would have on the final value of j(k, g). Looking at these operations on a computational graph makes the process of computing gradients of complex easier to understand. Lets look at how we would compose these functions on a computational graph.

Computational Graph

The computational graph is built up of nodes containing inputs that flow forward through “gates” which perform a mathematical operation on one or two inputs. The inputs provide signals that flow through the nodes that produce the output.

The gradient is a backward flowing signal that gives us information about how the inputs affects the output of the graph.

Let us start by examining the local gradients of the functions as they flow through their nodes.

For the function f(x, y) we can see that when x increases by one f(x,y) increases by y and when y increases by one f(x, y) increases by x.

For the function z(k, g) as k or g increases by one the function z(k,g) increases by one.

Lets look at these local gradients as they flow backwards through their nodes on the computational graph.

Local gradients flowing backwards

The graph above just shows the the signals that each input has on its own node. What we are really interested in is the backward flowing signal of each input on the output of the entire graph. To do this we let the gradients from the upper layers flow backwards into the lower ones. This is done through multiplication.

Computational graph “chained” gradients

As the gradients from the upper layer flow backwards they are multiplied by the gradients by the gradients of the lower layer to get the lowers layers impact on the output of the full function. We can see from this graph that if x is increased by one to 3 the output of the graph is 50, which is 15 greater than the current output of 35. Hence the gradient of x with respect to the full graph is 15. An increase of x by one creates a pull on the output of the full function of 15. Here we see the chain rule in effect.

The chain rule used to compute partial derivative of g with respect to x
The chain rule used to compute partial derivative of g with respect to y

When we see a node that branches out to several nodes, the gradients flowing from each are summed up.

The gradients flowing back to this node should be summed up

We can think of the loss function as one giant computational graph that is composed of the scoring function. We can use the chain rule to compute the gradients of the weights and biases with respect to the loss function. The graphs we will be using will of course have matrices instead of scalars, but gradients work the same way. We compute the “error” of the last layer and use it to easily compute the errors of the previous layers, and luckily for us the error is a simply as f(x) — y, where f(x) is our function and y is the training data.

Now that we have gotten the theory out of the way, lets dive into the programming. You’ll find deep learning is rather quite accessible, when you approach it by simply diving into problems. We’ll be able to write a program that builds and trains a neural network with an arbitrary amount of layers in less than 200 lines of code in nothing but python and numpy. We’ll use more powerful framework like Tensorflow in later tutorials. First navigate to your bash terminal and create a new environment, the Python version we want to use is 3.5.

Create Conda Python 3.5 Environment

After that you are going to want to activate your newly created environment, with source activate deeplearning in your bash console.

Install Packages

After you activate your environment, fire up a Jupyter notebook server. We’re going to import numpy and get started by defining the two activation functions, and some functions for initializing our weights and biases.

Import libraries make parameters

Notice the way I have chosen to initialize my weights by setting them close to small non-zero values. It is important to set your biases to non-zero values in order to break symmetry. If all the weights are zeros all of the outputs will also be zero, resulting in no gradients being passed back.

Then we need to define our activation function. Recall that the rectified linear unit is the identity function for non-negative values and zero for negative ones. So all we need to do is use numpy's Boolean selection capabilities to turn all the negative values into zeros. The function takes a keyword argument derivative, which instructs it to compute the derivative during a backward pass.

Relu and Softmax Activations

Now onto building the network. We’ll create an MLP class which has an __init__ method that takes in a tuple of sizes as an argument. We will use the get_weights, and get_biases methods to build our initial guesses at the appropriate parameters.

Network __init__ method

Next we are going to design the forward pass.

Forward Propagation Method

As you can see we are going to create two lists meant to store the activations and preactivations of out network. It is important to store these as the are necessary to compute the gradients. We store the initial value for X in a variable called a, and then use a for loop to compute the preactivations and and activations for each layer up until the final one. We are going to append the activations and preactivations to their respective lists. We then compute the preactivations and activations for our final layer.

Compute Loss

We compute the loss that is added due to L2 regularization by multiplying the squared weights by 1/2 and the regularization parameter.

We then have a method that computes the cross entropy loss. It takes a y value in either one hot encoded form or regular labels. We use boolean indexing to determine the predictions. We take the negative log of the probabilities and compute the mean, and then we add the regularization loss.

We use backpropagation to compute the gradients, and then add the gradients from the regularization process to the computed gradients. How does this code work.

Backpropagation

The backpropagation algorithm although intimidating is really just a series of simple matrix operations. First we calculate the errors, which are the partial derivatives of the preactivated neurons. We do this by looping through our reversed preactivations. We use an if statement to check as to whether or not this is the final layer and compute it’s error accordingly. We insert the error at the last index of the list preactivation_gradients.

See that the partial derivatives of the activations of that layer is the transposed weight matrix. We use the chain rule to compute the derivative of the activation layer with respect to the output. We then use the numpy’s multiply method to apply the chain the local gradients of the preactivation layer and the the gradients of the activation layer with respect to the cost function.

We then use the enumerate built-in-function to loop over the activations list. We know that the gradient of the weights is the activation layer it is multiplied by. We use the activation as each weights local gradient with respect to the preactivation of the upper layer and chain it to the gradient of the upper layer with respect to the cost function. We find the gradients of the biases by summing up the errors of each layer.

Parameter Updates

The update parameter method takes in the weight and bias gradients as an argument along with the learning rate. It is fairly straight forward, we loop over the length of the bias and weights and subtract their gradients multiplied by our chosen learning rate.

Train network with mini-batch gradient descent

We train the method using mini-batch gradient descent, though you are welcome to look into other optimization techniques yourselves. We compute the number of times we will have to sample for the model to see all the data with the batch size you selected. We use the numpy’s random.choice function to select a pair of indices to use from our data, and we simply use the loss method to compute the loss and the gradients. The reason we do this in batches is because it is computationally expensive to do giant matrix multiplications.

Compute final predictions

The last method we need to define is the predict method. It simply applies numpy’s argmax function to the probability scores to tell us the class label the model assigned.

That’s it. A pure python implementation of a feed forward neural network. In the following lessons we will go about gathering data. Check it out in this repository containing the file. It trains on some dummy data created by sklearn’s make classification function. Use it as the basis for your own implementation, and play around with some of the hyper parameters.