Build Your First Neural Network

Tej Prakash Agarwal
VIT Linux User Group
12 min readMay 23, 2020

“YOU DON’T HAVE TO BE GREAT TO START, BUT YOU HAVE TO START TO BE GREAT.”

–ZIG ZIGLAR

Today, the whole world is facing a very tragic situation in which almost the whole world is locked down stopping a lot of activities. Well, we could rather use this time to learn something new and interesting. I am writing this article to help those who are just trying to get started with neural networks.So, without wasting time lets’s get going!!

So, basically we are not going to build a very efficient neural network but we will discuss about the basic structure of a neural network and its core components. However, this neural network will have good efficiency but we can always make it better and for that we will have to look for various metrics or whether our model is overfitting or underfitting the training dataset. And I would like to keep that for subsequent articles. This one is just basics. So, now we will build a model that can predict whether it will rain or not given certain conditions like temperature and wind speed.

  1. GETTING A DATASET

So, to start making our first neural network we first need a dataset to work on. For this article I will be using this dataset

“https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

2. IMPORTING NECESSARY LIBRARIES

So in this section we will import libraries that we need to work on our data and train our model.

import numpy as npimport pandas as pdimport math

3. GETTING FAMILIAR WITH DATABASE

We can use pandas.read_csv to convert dataset into dataframe and then analyze it.

df = pd.read_csv("weatherAUS.csv")

Now we can use df.head() to view first few rows of the dataset

print(df.head())

You will get something like this:

Now we may use df.describe() and df.info() to know more about our columns or attributes.

print(df.describe())
print(df.info())

4. DATASET PREPROCESSING

Now, before we jump on to our work we first need to preprocess our data and make it fit for our use just like we peel a fruit before eating it. So, we need to follow the steps below:

i) Using np.NaN

In our dataset we have a lot of cells that have ‘NA’ written that means data is not available so we replace these cells with np.NaN which is easier to work on.

df[df == ‘NA’] = np.NaN

You will get something like this(look for NaN values):

ii)Replacing ‘Yes’ and ‘No’ with 1 and 0

Now, in our “RainTomorrow” column we have values “Yes” and “No”. You can check this as follows:

print(df[‘RainTomorrow’])

First we need to convert them into 1 and 0 so as to use them to train our model.

df.RainTomorrow.replace((‘Yes’, ‘No’), (1, 0), inplace=True)

Note: Here, we don’t see values labeled “Yes” or 1 but they are also changed similarly.

iii)Defining X and y

Now, defining “X” and “y”, i.e. , our input and output arrays is an important job.

We can define “y” as follows:

y = df[‘RainTomorrow’].values
y = y.reshape(y.shape[0],1)

The above line means we allocate ‘RainTomorrow’ as our output. ‘.values’ is used to return a Numpy representation of the DataFrame. It returns only the values in the DataFrame while the axes labels are removed. Also, we reshape y so that it is of shape (m,1) and not (m,) and this will be useful later on.

Next up is defining X which is as follows:

X = df.drop(‘RainTomorrow’,axis = 1)

This means we are allocating all the columns except ‘RainTomorrow’(which we remove by using .drop()) to X. Now, as you can notice I didn’t use ‘.values’ here and that is because before I can use it I need to do one more thing, i.e. , one hot encoding . But, what is it? As some columns in our dataframe have categorical values which we can’t use with our mathematical model, we need to split these columns. But how do we split these columns? We split these columns which have categorical data into as many columns as depending on the number of categories in them. Each column contains either a 1 or a 0 which tells that a particular row belongs to a particular category or not.

Image from kaggle

So lets implement it.

X = pd.get_dummies(X.drop(‘Date’, axis = 1), columns = [‘RainToday’,’WindGustDir’,’WindDir3pm’,’WindDir9am’,’Location’], drop_first=True)

Now I will just implement ‘.values’.

X = X.values

Now,just one last step and we can get going. Remember, the np.Nan values we created from ‘NA’ values. Well, we still can’t use them directly so we will convert them into mean values of respective columns. For this we will use np.ma()

X = np.where(np.isnan(X), np.ma.array(X, mask=np.isnan(X)).mean(axis=0), X)

So that’s pretty much it from preprocessing and now we have our data just as we need it or in other words our fruit is now peeled off . Let’s just enjoy it.

5. CREATING SIGMOID FUNCTION(and its derivative)

We are going to use sigmoid functions as our activation function. So, for this we need to write down a function to implement it in python and also we are going to need to its derivative during th back propagation so better we write down its derivative as well.

def sigmoid(z):                 #sigmoid function    s = 1/(1+np.exp(-z))    return s,zdef diff_sigmoid(z):            #differentiation of sigmoid function    s = 1/(1+np.exp(-z))    ds = s*(1-s)    return ds

6. INITIALIZING WEIGHTS

Now, first of all we need to initialize our weights and biases. To initialize weights first of all we need to decide how many nodes or units will we have in each layer so as to train our model. This is important as sizes our weight matrices depend on it. For this article I will be using a simple neural network with just 2 hidden layers. Something like this:

Now, for this we will initialize a list of number of nodes in each layer(including input layer)

nodes = [X.shape[1],10,10,1]

Now, I have used X.shape[1] as number of columns in X will be the number of nodes in input layer

Now, we define our initialization function:

def initialize(nodes):    params = {}    L = len(nodes)    for l in range(1, L):        params[‘W’ + str(l)] = np.random.randn(nodes[l],nodes[l-1])*0.01        params[‘b’ + str(l)] = np.zeros((nodes[l],1))    return params

Now, let’s try to understand what’s happening here. First we pass nodes list to the initialize function. Now, we declare a dictionary of our parameters, i.e., W1,W2,W3,b1,b2,b3(in case of more layers we will have even more matrices).Now, we initilize params by the lines in bold. As we know the size of a weight matrix of a layer is m,n where m is the number of nodes in that layer and n is the number of nodes in the layer preceding. And similarly size of a bias matrix is m,1 where m has same meaning as before. Also, you can see I have multiplied by 0.01 while initializing weights. This is because I am using sigmoid function for activation which looks like this:

Image from Wikipedia

As you can see for high values its curve becomes bit flat thus making gradient descent slow and thus leads to more training time. So, to avoid this we initialize weights by multiplying with 0.01 .

7.FORWARD PROPAGATION

So, finally we have reached the forward propagation.

“Some people want it to happen, some people wish it would happen, others make it happen.”
-Michael Jordan

Let’s move into the interesting part and finish the journey.

Now, as we know forward propagation involves calculating linear function

Z = WA + b

(where, Z is the output of linear function for a particular layer,A is the activation of previous layer and W and b are weights and biases corresponding to that same layer.)

And, then we calculate activations on these linear functions by the formula

σ(Z)

where σ is the sigmoid function given by

Image from Wikipedia

So, first we implement linear function.

def linear(A, W, b):    Z = np.dot(W,A)+b    cache = (A, W, b)    return Z, cache

Now don’t worry about what cache is, we will see it in the backpropagation section. We just use it to store activations of previous layer and weights and biases of each layer which we will use later at the time of backpropagation.As for Z it multiplies W and A(which is the activation outputs of previous layer) and adds bias b to the product.

Now, comes the activation function.

def activation(A_prev, W, b):    Z, linear_cache = linear(A_prev,W,b)    A, activation_cache = sigmoid(Z)    cache = (linear_cache, activation_cache)    return A, cache

Now , here we just use our previously made functions sigmoid and linear. We first get linear function by linear and then apply sigmoid to it. Activation cache is nothing but Z as we returned it in our sigmoid function.

Now, let’s combine all we did above to create a final function forward_prop().

def forward_prop(X, params):    caches = []    A = X    L = len(params) // 2                #number of layers    for l in range(1, L):        A_prev = A        A, cache = activation(A_prev, params[“W”            + str(l)], params[“b” + str(l)])        caches.append(cache)        AL, cache = activaton(A, params[“W” + str(L)], params[“b” + str(L)])        caches.append(cache)
return AL, caches

Here, we just use for loop to iterate over different layers and apply “activation” function that we created above over each of them and return activation of final layer or in other words the predicted output and cache.

Let’s see our forward_prop() function in action. Remember, we already initialized our parameters and biases. Now, we will just pass them to forward_prop() and … leave it let’s just see it.

params = initialize(nodes)AL, caches = forward_prop(X.T,params)
print(AL)
print(AL.shape)

The output is quite self-explanatory , i.e. , we have a column vector of length equal to number of training examples here each element is the predicted output of each example. Also , you see I pass X transpose to forward_prop() and not X and that’s because our numpy array X contains our training examples row-wise but in our model we have assumed that we have training example stacked column wise.

8. COST FUNCTION

Before moving forward we must implement cost function. Cost function basically is a function that is used to quantify error in predicted ooutput w.r.t actual output and present it as a number.

Cost function we are going to use here is:

Let’s implement it in python.

def compute_cost(AL, Y):    m = Y.shape[1]    cost = (-1/m)*np.sum(Y*np.log(AL) + (1-Y)*np.log(1-AL))    cost = np.squeeze(cost)    return cost

9. BACK PROPAGATION

So, gradually we are approaching our destination. Finally, we have reached back propagation. In case of back propagation also we have 3 steps.

i) Calculate dAL

Here, we calculate derivative of our cost function w.r.t activation function of output layer.

ii)Calculate dZ

Using dA and chain rule of differentiation we calculate derivative of our cost function w.r.t linear function(WA + b).

iii)Calulate dW and dB and dA

Similarly, we use chain rule and dZ to calculate dW and dB that are derivatives of our cost function w.r.t. weights and biases. Also, we calculate dA , i.e. , derivative of our cost function w.r.t to the activation function function of previous layer.

First, we implement function to calculate dW and dB and also dA from dZ

We, will use these formulae:

Here, J is the cost function, m is the number of training examples. W,b,Z have usual meaning.

Note:If keepdims is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.(from offficial docs)

def dW_dB(dZ, cache):

A_prev, W, b = cache
m = A_prev.shape[1]
dW = (1/m)*np.dot(dZ,A_prev.T)
db = (1/m)*np.sum(dZ,axis=1,keepdims=True)
dA_prev = np.dot(W.T,dZ)


return dA_prev, dW, db

Now, to use the above function we first need to have dZ. So, let’s calculate dZ from dA using following formula:

Here, g(z) is the activation function.Z and A have usual meaning.

def dZ(dA, cache):    linear_cache, activation_cache = cache    dZ = diff_sigmoid(activation_cache)*dA    dA_prev, dW, db = dW_dB(dZ, linear_cache)    return dA_prev, dW, db

Here, cache is same that we stored during forward propagation and actiavtion cache stores values of z.

Now, we need to define a function back_prop() to sum up everything that will calculate dAL and then call the above 2 functions to calculate dW and dB for each layer to update parameters.

def back_prop(AL, Y, caches):    grads = {}    L = len(caches)    m = AL.shape[1]    Y = Y.reshape(AL.shape)    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))    current_cache = caches[L-1]    grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = dZ(dAL, current_cache)    for l in reversed(range(L-1)):
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = dZ(grads["dA"+str(l+1)], current_cache) grads["dA" + str(l)] = dA_prev_temp grads["dW" + str(l + 1)] = dW_temp grads["db" + str(l + 1)] = db_temp return grads

10. GRADIENT DESCENT

Now, to use gradient descent we need to update our parameters on each iteration for which we need to use the following function.

def update(params, grads, learning_rate):    L = len(params) // 2    for l in range(L):         params["W" + str(l+1)] = params["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]         params["b" + str(l+1)] = params["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]    return params

This function returns the updated parameters after one iteration.

11. FINAL MODEL

Now, that we have all the necessary functions implemented we just need to define one function to integrate them all and we willl call it fit().

def fit(X, Y, nodes, learning_rate = 0.0075, num_iterations =1000, print_cost=False):    costs = []    params = initialize(nodes)    for i in range(0, num_iterations):        AL, caches = forward_prop(X, params)        cost = compute_cost(AL, Y)        grads = back_prop(AL, Y, caches)        params = update_params(params, grads, learning_rate)        if print_cost and i % 100 == 0:            print ("Cost after iteration %i: %f" %(i, cost))        if print_cost and i % 100 == 0:            costs.append(cost)    return params

Now, that we know how to fit our model let’s define a function to predict using our parameters returned by fit().

def predict(X,params):    AL,cache = L_model_forward(X,params)    for i in range(AL.shape[1]):        if AL[0,i]<=0.5:            AL[0,i] = 0        else:            AL[0,i] = 1    return AL

In the end, let’s calculate the accuracy of our model.

def accuracy(AL,Y):
ACC = AL == Y
return (np.sum(np.count_nonzero(ACC,axis=0)))/(ACC.shape[0]*ACC.shape[1])

Here, np.count_nonzero() for multi-dimensional array counts for each axis (each dimension) by specifying parameter axis.

So, I calculate number of cells for which AL == Y and divide by total number of cells to get accuracy.

10. ENDGAME

So, we are in the endgame now. Having made our fit(),predict() and accuracy() functions, we are all set. Now, we are going to divide our data into training data and testing data and then train our model on training data and then test it on our test data and then calculate accuracy.

i)Divide the dataset.

I will 80% of original data as training data and rest as test data.

training_data_size = math.floor(num_house*0.8)train_X = np.asarray(X[:training_data_size])train_y = np.asarray(y[:training_data_size])test_X = np.asarray(X[training_data_size:])test_y = np.asarray(y[training_data_size:])

ii)Train and Test

params = fit(train_X.T,train_y.T,nodes)AL = predict(test_X.T,params)accuracy = accuracy(AL,test_y)print("accuracy >> ",end='')print(accuracy)

So, as you can see we have achieved an accuracy of about 80% which is pretty good. We can increase this accuracy using various techniques which I leave for subsequent articles. I hope you gain something from this small effort of mine.

“Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.”

Winston Churchill

--

--