DeepClassifyML Week 2 Part 3


This post is a part of the series ‘Hasura Internship’ and covers setting up Hausra for local development. In addition to this, we finally see ‘how a neural network learns’. Also check out my previous posts : Part 1 , Part 2, Part 3, Part 4 for the app idea and some Computer Vision and Neural Network basics.

Setting up Hasura for local development was very straightforward since the instructions provided by Hasura in this README are well documented and super easy to implement. Here’s how I did it on my system:

Step 1: Install virtualbox. VirtualBox is a free, open source, cross-platform application for creating and running virtual machines (VMs) — computers whose hardware components are emulated by the host computer, the computer that runs the program. It allows additional operating systems to be installed on it, as a Guest OS, and run in a virtual environment. It is important to note that have the host computer should have at least 4GB of RAM (because the VM might take upto 2GB of RAM). Also be sure that you should have a 64 bit OS system.

Step 2: Install hasuractl. The command for installing it on my system(mac) was:

curl -Lo hasuractl https://storage.googleapis.com/hasuractl/v0.1.2/darwin-amd64/hasuractl && chmod +x hasuractl && sudo mv hasuractl /usr/local/bin/

Step 3: Install kubectl.

To get started with a project on Hasura, create a account on beta.hasura.io and then run the following command:

hasuractl login
Hausra Login

After logging in, run the following command (Note: if you are running the next command for the first time, it will roughly download about 1–1.5GB of docker images.):

hasuractl local start
Starting a project

Additional commands to stop and delete Hasura projects:

hasuractl local stop      ## To stop the running hasura platform.
hasuractl local clean     ## To clean up the incomplete setup.
hasuractl local delete    ## To delete the underlying VM

Lets quickly dive into ‘Backpropagation’ and ‘Gradient Descent’. In part 4 we saw how a NN makes a prediction during what we call ‘Forward Propagation’. We predicted if a student gets into a university based on his/her previous score. Now that we have a prediction how do we know if its correct or not and how close are we to the correct answer. This is what happens during ‘training’ or updating those weights to make the predictions.

What we’d like is an algorithm which lets us find those weights and biases so that the output from the network is close to the correct answer. (Remember during training we have their scores and also know if they get admitted or not i.e we know the correct answer beforehand. We would like want to find what happens when a new student comes in). To measure this, we need a metric of how incorrect the predictions are. Lets call it the ‘error’(You will notice it is also know as the ‘cost function’ or ‘loss function’). The error can be written using the equation:

Sum of Squared Errors (SSE)

The error metric I used here is known as Sum of Squared Errors (SSE). I decided to choose this(there are other loss functions as well) because the square ensures the error is always positive and larger errors are penalised more than smaller errors. Plus, it makes the math look nice and less intimidating. Here ​f(x) is the prediction and y is the true value and then we sum over all data points i. This also makes sense since in the end we want to find how bad our predictions were from the correct answer. This means that if our Neural Network is not doing well, this ‘error’ will be large — that would mean that f(x) is not close to the output y for a large number of data points. Furthermore, if the cost(error) becomes small, i.e., SSE(f)≈0, precisely when y is approximately equal to the prediction, f(x) for all training inputs i, we can conclude that the NN has done a good job. So the aim of our training algorithm will be to minimise this ‘error’ as a function of the weights and biases. In other words, we want to find a set of weights and biases which make this ‘error’ as small as possible. We’ll do that using an algorithm known as gradient descent.

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. It requires us to calculate the gradient of the loss function (error) with respect to all the weights in the network to perform a weight update, in order to minimize the loss function. Backpropagation computes these gradients in a systematic way. Backpropagation along with Gradient descent is arguably the single most important algorithm for training Deep Neural Networks and could be said to be the driving force behind the recent emergence of Deep Learning.

Lets understand this by a classic example. Suppose you are at the top of a mountain, and want to reach the bottom (which is the lowest point of the mountain). So, how would you go about it? The best way is to look around in all possible directions, check the ground near you and observe which direction tends to descend the most. This will give an idea in what direction you should take your first step. We then repeat this process over and over again. If we keep following the descending path, it is very likely you would reach the bottom.

Plot for the Loss Function

Think of the large mountain as the error function. A random position on the surface of this plot is the cost of the current values of the weights and biases. The bottom of the mountain(and also for the plot) is the cost of the best set of weights and biases, plus the minimum error. Our goal would be to continue to try different values for the those weights and biases, evaluate the error and select new coefficients that result in a slightly better (lower) error. Repeating this process enough times will lead to the bottom of the mountain.

We take multiple small steps towards our goal. In this case, we want to change the weights in steps that reduce the error. Since the fastest way down a mountain is in the steepest direction, the steps taken should be in the direction that minimizes the error the most. We can find this direction by calculating the gradient of the squared error.

Gradient (or the derivative) is another term for rate of change or slope. The derivative is a concept from calculus and refers to the slope of the function at a given point. We need to know the slope so that we know the direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.

Lets find the derivative of a function f(x). We take a simple function f(x) = . The derivative will give us another function f​′​​(x) that returns the slope of f(x) at point x. The derivative of x​²​​ is f​′​​(x)=2x. So, at x=2, the slope is f′(2)=4. Plotting this out, it looks like:

Graph of f(x) = x² and its derivative at x = 2

The gradient is just a derivative generalized to functions with more than one variable. We can use calculus to find the gradient at any point in our error function, which depends on the input weights. You’ll see how the gradient descent step is derived on the next page.

The weights will be updated as :

W += n*delta

where n is called the ‘learning rate’, delta is the product of the error (y-f(x)) and derivative of the activation function (f’(x)). The gradient tells us the direction in which the function has the steepest rate of increase, but it does not tell us how far along this direction we should step. The is taken by a constant: the learning rate (or the step size) which is one of the most important hyperparameter settings in training a neural network.

We again use numpy for this purpose:

import numpy as np
## Sigmoid (Activation) function
def sigmoid(x):
return 1/(1+np.exp(-x))
## Derivative of the Sigmoid (Activation) function
def sigmoid_derivative(x):
return sigmoid(x) * (1 - sigmoid(x))
## Grades for single student in 4 subjects i.e only 1 data point
inputs = np.array([50, 22, 10, 45])
## Correct answer (1 : admitted, 0: not admitted) 
y = np.array([1])
## Initialise the weights and bias randomly
initial_weights = np.array([0.01, 0.8, 0.02, -0.7])
bias = -0.1
## Set a value for learning rate
learning_rate = 0.001
## Our Prediction (f(x))
output = sigmoid(np.dot(weights, inputs) + bias)
## Calculate the error i.e how incorrect are we
error = y - output
delta = error * sigmoid_derivative(output)
# Gradient descent step
change_in_weights = learning_rate * delta * inputs
## Updating our weights
new_weights = initial_weights + change_in_weights
print ('Initial Weights: {}'.format(initial_weights))
print('Our prediction: {}'.format(output))
print('Amount of Error: {}'.format(error))
print('Change in Weights: {}'.format(change_in_weights))
print('New weights: {}'.format(new_weights))

Output :

Initial Weights: [ 0.01  0.8   0.02 -0.7 ] 
Our prediction: 1.6744904055114616e-06
Amount of Error: [ 0.99999833]
Change in Weights: [ 0.01249998 0.00549999 0.0025 0.01124998] New weights: [ 0.02249998 0.80549999 0.0225 -0.68875002]

Though it is always helpful to understand the concepts behind Backpropagation, but if you found the maths hard to understand, its fine. The Machine Learning and Deep Learning libraries we use (scikit-learn, Tensorflow etc.) have built in tools to calculate everything for you.

(Edit: Please report any errors or discrepancies in comments or you can reach out to me: akshaybhatia10@gmail.com)