Introduction to Neural Networks, from scratch for practical learning (Part 1)

Pranjall Kumar
19 min readNov 6, 2021

--

Artificial Neural Networks (ANN) will be the first topic you learn when you decide to take a dive into the world of Deep Learning. Here we are literally trying to mimic our brain! This topic can however quickly become very daunting as it involves concepts from Multivariate Calculus, Convex Optimization, and Linear Algebra. I personally would highly recommend you to first get a good understanding of these topics so that the most elementary aspects of an ANN are clear to you.

That being said, I would like to introduce you to ANNs in a very gentle way. I personally believe in practical learning and learned a lot when I implemented ANNs from scratch by myself. It was a Eureka moment to see all my understanding of things fall in place. I wish to give the same experience to you.

So without further a due, let us begin. I will first show you a simple neuron in action in this part i.e. part 1. Yes, just one neuron first and also use it to solve a toy problem. Then I will move on to a full-fledged neural network and solve a much more complicated problem in part 2 of this article. Now, let’s start coding.

I will be assuming that you are using Google Colab. However, this is not mandatory. You can also do it locally on your Jupyter Notebook using Plotly offline. I would finally also recommend using dark mode for the best experience. First, we will load our Google Drive to Google Colab. It can be easily done using the following code. Just follow the steps and the grant permission required.

#code to mount drive
from google.colab import drive
drive.mount('/content/drive')

Now we will import the required libraries and set the default renderer to Google Colab for Plotly.

#importing libraries
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objs as go

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#setting the renderer as colab
pio.renderers.default = "colab"

Now let’s load the dataset. The dataset is available on Kaggle and can be found here: Dataset. Kindly note that I have made a folder named ANN in MyDrive and uploaded the dataset in that. The dataset did not have any column names so I gave names explicitly.

#loading dataset
dataset = pd.read_csv("/content/drive/MyDrive/ANN/Social_Network_Ads.csv")
dataset.columns = ["Unique_ID", "Gender", "Age", "Salary", "Purchased"]
dataset.head()
Social_Network_Ads.csv, Image by author.

You will see an output something like this. The Data consists of 5 columns. ‘Unique_ID’, ‘Gender’, ‘Age’, ‘Salary’ and ‘Purchased’.

Looking at this dataset it is clear that we are supposed to find some relation between the features ‘Gender’, ‘Age’ and ‘Salary’ and the target ‘Purchased’. Meaning, can we predict whether a person will purchase from the advertisement shown, given that the person’s gender, age, and salary are known?

It is a small dataset and thus making a toy problem for us on classification. Let’s solve this toy problem using a single neuron! Just to make the data easy to plot, I am simply dropping the ‘Gender’ column from our analysis. But in reality, this column can have some impact on the purchasing pattern. Ideally, whether a column should be dropped or not depends on your domain knowledge and other statistical information. But for now, let’s just take the data in the following manner.

#extracting features and target variable
X = dataset[["Age", "Salary"]]
Y = dataset[["Purchased"]]

You can see I have just taken columns 2 and 3 i.e. ‘Age’ and ‘Salary’ as features for easy plotting. Now that we have our features and target variables ready, we can move to the next step.

Before moving ahead we must first make sure that is there some pattern that exists in the data or not. If we as humans can’t see any pattern, how will we be able to develop a good classification model for the same? Since this data is small, we can simply plot and try to figure out a pattern ourselves. I am first separating purchased and not purchased data as follows.

#seperating data
X_purchased = X.where(Y["Purchased"] == 1)
X_not_purchased = X.where(Y["Purchased"] == 0)

Now let’s plot and see the data!

#visualizing data
trace0 = go.Scatter(
x = X_purchased["Age"],
y = X_purchased["Salary"],
mode = "markers",
name = "Purchased"
)
trace1 = go.Scatter(
x = X_not_purchased["Age"],
y = X_not_purchased["Salary"],
mode = "markers",
name = "Not Purchased"
)
data = [trace0, trace1]
fig = go.Figure(data)
fig.update_layout(title = "Salary vs Age",
xaxis_title = "Age", yaxis_title = "Salary",
template = "plotly_dark")
fig.show()

This code is to plot a graph in Plotly. If you are wondering how I can learn to plot beautiful graphs in Plotly, I highly recommend you guys to have a look at this article: A complete introduction to Plotly, from beginner to advanced. This article will take you from beginner to advanced plotting in no time. Moving ahead to the current data, once you run the snippet, you will find a scatter plot coming up looking as follows. The output of the above code snippet is shown below.

Salary vs Age, Image by author.

You can clearly see there seems to be a nice pattern. Young people with less income tend to not buy from advertisements. But older people do prefer buying from them. So now that we can see a pattern, we wish to explain this pattern to the machine. I could write if-else conditions, but that wouldn’t be a generic approach and will be strictly reduced to solving this toy problem with this exact kind of data.

We can do much better. We can train the machine to find a separating plane to classify the data all by itself! There are many ways to do that but since I want to introduce you to ANNs. I will solve this problem using a simple neuron. Yes, just one neuron is powerful enough to give us some satisfactory results! More on this later.

But first, I will standardize the data. This is done so that each feature has an equal contribution when we train the model. From the plot above, it is clearly visible that there is less variation in data on the x-axis (Age). Only in terms of tens. But on the contrary, there is a huge variation in data on the y axis (Salary). In terms of thousands. This can cause the feature ‘Salary’ to dominate if we train the model on such data as the separating plane will not be able to account for the small variations in ‘Age’. So, getting the data to the same scale across features is very important. I am going to standardize the data. This means I will make the mean 0 and set the standard deviation to 1. It can be done using an inbuilt library, but I would urge you to try doing it yourself without the library. It’s not difficult at all! The code to standardize the data is given below.

#standardizing data
X_sc = StandardScaler()
X = X.astype('float')
X = pd.DataFrame(X_sc.fit_transform(X))

Now, when you plot the data, it will look slightly different. Note that the data type of ‘X’ has now changed and hence the format for accessing data has also changed this time. This was done to let StandardScaler() do its thing properly by changing datatype from ‘int’ to ‘float’.

#visualizing data
trace = go.Scatter(
x = X[0],
y = X[1],
mode = "markers"
)
data = [trace]
fig = go.Figure(data)
fig.update_layout(title = "Salary vs Age",
xaxis_title = "Age", yaxis_title = "Salary",
template = "plotly_dark")
fig.show()

The output will be as follows. Note how the data is now on a similar scale.

Salary vs Age (Standardized), Image by author.

Now, I will split the data into test and train. Train data as the name suggests will be used to train our neuron to learn the inherent pattern in the data, and the test data will be used to figure out how well our neuron learned the pattern. The code below will do the split for you.

#splitting data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
random_state = 69,
test_size = 0.25)

Here you can see that 25% of the data is being used for testing and the rest is being used for training.

I will now move on to converting the data into NumPy arrays. This is done to utilize the concept of ‘vectorization’ in python.

#converting to numpy arrays
X = np.array(X_train)
Y = np.array(Y_train)
m = len(Y_train)
Y = Y.reshape(m, 1) #making size (300,) to (300, 1)
X_test = np.array(X_test)
Y_test = (np.array(Y_test)).reshape(len(Y_test), 1)

Notice how I made sure to change the shape of ‘Y’ to make it from 300 list items to 300 rows and 1 column.

Finally, we have our data ready! We can now move on to implementing the neuron. So let me give you a small introduction to neuron and neural networks. The below diagram will help you a lot.

An Artificial Neuron:

Artificial Neural Networks, Image by author.

You can see a neuron on the left, a neural network is simply a connection of these neurons which can be seen on the right (the bubbles for propagation function and activation function are clubbed into one). The inputs are the features that we will provide. In our case we have just two features, ‘Age’ and ‘Salary’ which can correspond to ‘x1’ and ‘x2’ if you want. A dummy input 1 is also considered. This is to facilitate a special weight called bias, ‘w0’ in our case. But is mostly written as ‘b’. each input has a corresponding weight assigned to it. It basically signifies the weightage of the corresponding input in layman terms. But mathematically, once you pass the features along with their weights through the propagation function, as shown above, you will get w2*x2 + w1*x1 + w0. Now, if you have knowledge about 3D surfaces, you will be quick to correspond this equation to the equation of a plane. Now you see why w0 is important? It makes the plane independent of origin giving us the flexibility to place it where ever we want. For example, if you consider the equation y = mx + c. Now remove c (called the y-intercept) from the equation, it simply becomes y = mx and such a line (no matter what value of m you choose) will always pass through the origin.

Thus we have the equation of our plane that is supposed to split this data into two parts, ready. Keep in mind that ‘x1’ and ‘x2’ are known! we have to figure out the values of ‘w0’, ‘w1’ and ‘w2’ to figure out where the plane should lie to optimally split the given data into two parts. One side of the plane will be for ‘purchased’ and the other for ‘not purchased’.

Image by author

For a single neuron, the activation function I will use is a sigmoid function. Such an activation function ensures the value stays between 0 and 1 as you can see from the graph alongside. It is given by 1/(1+e^-x).

No matter what value you provide to it (the distance of the point from our separating plane in this case), it will convert it between 0 and 1. This is helpful since we have 2 classes, ‘purchased’ and ‘not purchased’. So we can simply convert values less than 0.5 to 0 and the rest to 1 to classify the points into separate classes.

But in a neural network, the activation function performs a very important role. It provides non-linearity (more on that later in part 2, where we discuss neural networks).

Ok, back to the topic of finding the right values of ‘w0’, ‘w1’, and ‘w2’. How do we do that? For that, we have a concept of the error function and gradient descent. You might have guessed, the role of the error function is to tell, ‘How wrong the separating plane is’. And gradient descent helps to gradually reduce the error in plotting the separating plane to as low as physically possible. All right then, how do we define the error function and apply gradient descent on it?

Error function:

Let me explain to you the concept of error function using this example shown in the diagram below. You can see there is a line called a regression line, sort of following the trend of the given data points. Our job is similar to this only, except we are trying to find a separating plane rather than a regression line. The error function to fit such a line can be defined as follows.

Image by autor

Let yj be the actual value of any one of these points in green. Let the equation of the line be given by w1*x + w0 (just like mx + c). Let us say for some value of ‘w1’ and ‘w0’ we get the line shown alongside. Then the predicted value for some point xj corresponding to yj will be given by, w1*xj + w0 (This is a point that will lie on the line corresponding to yj). Now as you can see, none of these points actually lie on the line. So there will be some error between the actual value and the predicted value. Which will be given by, (actual — predicted) i.e. simply yj — (w1*xj + w0). Farther the actual and predicted values will be, more will be the error! We take the sum of all such errors and take the mean to give the total error. But there is one catch. The total error is given by,

Sum((yj — (w1*xj + w0))²)/n for all j points in the dataset, where n is the total number of data points. You might wonder, why the power of 2? few trivial reasons are:

  1. It removes the negative distances.
  2. It makes small errors smaller and large errors larger i.e. 0.4 to 0.16 and 4 to 16

But actually, the main reason for it is to make the error function a convex function. Now, what is a convex function? Simply put it is some kind of a function that has a guaranteed single global minimum across its entire domain (A place where the gradient is zero). What does that mean? and how does that help? Here is an example. Consider these two figures.

Image by author

The function on the left is a parabola and a classic example of a convex function. You can see how this function has a single global minimum in its entire domain. We want all our error functions to be convex, so that no matter from where we start our gradient descent, we will reach the minimum at some point in time (provided a good learning rate).

Image by author

This function on your left is not convex. You can see that if we start our gradient descent from say x = 1, we will be stuck on the point x = 0 because the gradient is zero at that point. But the minimum is actually somewhere beyond x = -1. So we try to avoid such non-convex functions as our error functions.

Conceptually, finding a convex function is very easy! Just draw a line joining any two points on the function, and if the function always lies below this drawn hypothetical line or just touches it, it will be a convex function. Note that it should be true for every pair of the points you take on the function. Therefore, if I connect let’s say point x = -1 and point x = 0 by a straight line, you can see that the function actually never goes below the line at all. So it is not a convex function for sure.

But it is easier said than done. Proving mathematically that the function will always lie below the hypothetical straight line for all points in its domain can be challenging, more so for multi-variate functions. For now, you just take it as a grain of salt that the error function for the fitting of the regression line, sum((yj — (w1*xj + w0))²)/n is a convex function. And that square term plays an important part in it. Making it a paraboloid basically.

So now we understand the importance of error function and its requirement to be convex. So let’s make the error function for our example. We are not fitting a regression line, we are fitting a separating hyperplane. So you can imagine that the error function should look a little different. It looks a lot different in fact. You know the fact that yj here takes discrete values of only 0 and 1. Unlike in the previous example where it was continuous. So we must pass our separating plane via a sigmoid function to make the values come between 0 and 1. Hence, the function of the plane should replace the x in 1/(1+e^-x) and that, in turn, will be our predicted value. Also, we need to keep in check that our error function should be convex. So considering all the facts mentioned above, we get an error function as follows.

E = -sum(yj*log(sigmod(w1*xj + w0)) +

(1-yj)*(1-log(sigmod(w1*xj + w0))))/n

Goodness me! quite a mouthful, right? anyways, it is what it is. For now please take it as a grain of salt that this too is convex. I can’t cover everything in one article. Also, I got away by writing sigmoid() instead of 1/(1+e^-x). This is the error function of Logistic Regression. Note how the name says regression as it derived from the same concept of Linear Regression but it actually does classification!

However, you can make some sense out of it. when yj is 1, the term with (1-yj) becomes zero. And if predicted value i.e. sigmod(w1*xj + w0) is close to 0 (incorrect prediction) then the error will be high since lim(log(x)); x tends to 0 is -infinity (right hand limit). And if the prediction is close to 1 then lim(log(x)); x tends to 1 is 0 (left hand limit). Hence giving less error. A similar thing happens when yj is 0.

Gradient Descent:

Now we will move on to Gradient descent and how it helps in reducing this error and automatically help the machine find the values of ‘w2’, ‘w1’, and ‘w0’. It is a very simple concept. In 2D, the gradient is simply the derivative of the function (slope of the tangent to the curve at a given point). In multiple dimensions, it simply points to the direction of the fastest ascend (opposite to descent) of the function and its magnitude is the rate of increase in that direction. Since we took care in making our error function a convex function, we can literally start from anywhere on this error function.

Image by author

Consider this diagram again (x²). Let this be our error function this time. As discussed before, refer to the error equation of Logistic Regression and note that it too is a function of ‘w1’ and ‘w0’. But here I take the parameter as x.

Now let us say we start randomly at x = 1. We know that the function has a minimum value of 0 at x = 0. Notice how the minimum point is the single global minimum of this function and the only place where the derivative is 0. Thus our aim is to reach that point x = 0 from x = 1. So we need to reduce the value of x. We can easily achieve that by doing as follows. Simply, x = x — alpha*derivative(x²); at x = 1. That means reducing the value of x by some amount related to alpha and the value of the slope of the tangent to the curve x² at x = 1. Notice how derivative(x²); at x = 1 is positive. and alpha is always a positive number. Therefore the net effect is that x is reduced from 1 to some smaller amount than 1. Also, notice how x would have increased if we had started from the point x = -1 using the same equation. This process is repeated until the difference in new and previous values of x is under some threshold. Because reaching the exact point where the derivative is zero is practically impossible as the derivative keeps getting smaller and smaller as it reaches the minimum.

Here alpha is called the learning rate and plays a very important role.

  • If alpha is ridiculously large, then the error will never converge, in fact, might increase. (we will see this in part 2)
  • If alpha is a bit more than required, then the path taken to reach the minimum will not be ideal and will oscillate about the error function.
  • If alpha is too small, then we are just wasting our time and resources, the convergence to the minimum will be very slow.

Thus, choosing a good value for alpha is very crucial and is a trial and error looking at the values of the error function.

We have finally covered all the theories needed. We can now jump to coding! If you understand all the concepts properly, coding will be a breeze.

Summary:

The neuron has to find the right values of ‘w2’, ‘w1’, and ‘w0’. Using the Logistic Regression error function and Gradient descent to help us place a separating plane that has a minimum error! Needed to cover all that to write this one-liner. Phew!

Now, I will initialize the parameters ‘w2’, ‘w1’, and ‘w0’ to random values. ‘w0’ is taken as ‘b’.

#initializing parameters
W = np.random.rand(2, 1)
b = np.random.rand(1, 1)

I have found the number of iterations and learning rate by trial and error. I am also making lists to keep track of intermediate error values, updated parameter values, and the current iteration.

#initializing values
iterations = 100
learning_rate = 0.9
parameter_values = np.zeros((3, iterations))
error_values = np.zeros((1, iterations))
iteration_values = np.zeros((1, iterations))

Now I am writing to code to perform gradient descent.

#gradient descent
for iters in range(iterations):
#popagation function of the neuron.
P = np.dot(X, W) + b

#activation function of the neuron.
A = 1/(1 + np.exp(-P))

#calculating error.
E = -(1/m)*(np.dot(Y.T, np.log(A)) + np.dot((1 - Y).T, np.log(1 - A)))

#storing intermediate error values to make graph.
error_values[0][iters] = E

#calculation of gradient.
delta_Jw = (1/m)*np.dot((A - Y).T, X)
delta_Jb = (1/m)*(np.sum(A - Y))

#gradient desccent.
W = W - learning_rate*(delta_Jw.T)
b = b - learning_rate*(delta_Jb)

#storing intermediate parameter values to make graph.
parameter_values[0][iters] = W[0]
parameter_values[1][iters] = W[1]
parameter_values[2][iters] = b

#storing intermediate iteration values to make graph.
iteration_values[0][iters] = iters

I had to manually calculate the partial derivatives given by ‘delta_Jw’ and ‘delta_Jb’. Now see how everything falls in place! Let the code run and acknowledge the magic happen! all the theories we learnt so far will be tested to write this part of the code.

Now let’s see the fruits of our labor. Let’s plot and see whether the error converged properly or not.

#plotting the error values
trace = go.Scatter(
x = iteration_values[0],
y = error_values[0]
)
data = [trace]
fig = go.Figure(data)
fig.update_layout(title = "Convergance of error",
xaxis_title = "Number of iterations",
yaxis_title = "Error value",
template = "plotly_dark")
fig.show()

The output should look something as follows:

Convergence of error, Image by author.

Yes, things seem to have worked well. You can see 100 iterations was overkill. But the data was so small so I didn’t bother as everything was fast enough even on Colab.

Let’s check what happened to the parameter values.

#plotting the convergence of parameters
trace0 = go.Scatter(
x = iteration_values[0],
y = parameter_values[0],
name = "W2"
)
trace1 = go.Scatter(
x = iteration_values[0],
y = parameter_values[1],
name = "W1"
)
trace2 = go.Scatter(
x = iteration_values[0],
y = parameter_values[2],
name = "W0"
)
data = [trace0, trace1, trace2]
fig = go.Figure(data)
fig.update_layout(title = "Convergance of Parameters",
xaxis_title = "Number of iterations",
yaxis_title = "Parameter value",
template = "plotly_dark")
fig.show()

The output must look similar to this. Remember that we started from random values, so the start points in your graph can be different but the final values should be the same.

Convergence of Parameters,Image by author.

So far so good, our model is ready. We should now use it to predict the values.

#predicting test values
P = np.dot(X_test, W) + b
Predictions = 1/(1 + np.exp(-P))

Since the values are between 0 and 1 due to the use of sigmoid. We need to make them exactly either 0 or 1.

#classifying into proper classes
for i in range(len(Predictions)):
if Predictions[i] >= 0.5:
Predictions[i] = 1
else:
Predictions[i] = 0

Excited to see how your neuron performed? Let us make a confusion matrix telling us what went wrong and what didn’t.

#making confusion matrix
cm = np.zeros((2, 2))
for i in range(len(Y_test)):
if Y_test[i] == 1 and Predictions[i] == 1:
cm[0][0] = cm[0][0] + 1
elif Y_test[i] == 1 and Predictions[i] == 0:
cm[0][1] = cm[0][1] + 1
elif Y_test[i] == 0 and Predictions[i] == 1:
cm[1][0] = cm[1][0] + 1
else:
cm[1][1] = cm[1][1] + 1
print(cm)

Output: [[23. 12.] ‘\n’ [ 7. 58.]]

So from the 100 test points, 81 were properly classified (23 + 58). So accuracy is 81% which is a no-brainer, but I will put the code for finding accuracy from the confusion matrix as well.

#calculating accuracy
diagonal_elements = np.diagonal(cm)
num = np.sum(diagonal_elements)
dem = np.sum(cm)
accuracy = num/dem
print(accuracy*100)

Output: 81.0

Ok cool! now let’s see how the separating plane fitted to the data looks like. I will plot the intersection of the separating plane and the x-y axis using the following code. Since the parameters were learned on the scaled data, I will have to show you the intersection on the scaled test data only.

#visualizing results
trace0 = go.Scatter(
x = X_test[:, 0].reshape(len(X_test),),
y = X_test[:, 1].reshape(len(X_test),),
mode = "markers",
name = "Data points",
marker = dict(color = Y_test[:, 0].reshape(len(Y_test),))
)
X = np.arange(-1, 1.5, 0.1)
X = X.reshape(len(X), 1)
Y = -W[0]/W[1]*X - b/W[1]
trace1 = go.Scatter(
x = X.reshape(len(X),),
y = Y.reshape(len(Y),),
name = "Separating line"
)
data = [trace0, trace1]
fig = go.Figure(data)
fig.update_layout(title = "Fitted Hyperplane",
xaxis_title = "Age", yaxis_title = "Salary",
template = "plotly_dark")
fig.show()

The output will look as follows:

The fitted plane, Image by author.

You can literally count the miss classified points, 12 on one side and 7 on the other as shown in the confusion matrix.

All right then! this was a big article, but I hope I did a good job in explaining to you the basics of a neuron and also showing you how you can use a single neuron to perform Logistic Regression. I will implement a complete ANN from scratch in the next part. I hope you guys enjoyed just like me writing this article. You can increase the accuracy by explaining the data in higher dimensions (more than 2 features). I would urge you to try that yourself.

Please excuse me now. I will take some well-deserved rest and get back to you in Part 2 of this article. Do comment if you were able to improve the accuracy or not!

Bye!

--

--

Pranjall Kumar

Research Engineer at Siemens. Free time is photography time.