Introduction to Neural Networks

Published in

The Startup

9 min readAug 22, 2020

There has been hype about artificial intelligence, machine learning, and neural networks for quite a while now. I have been working on these things for over a year now so I would like to share some of my knowledge and give my point of view on Neural networks. This will not be a math-heavy introduction because I just want to build the idea here.

I will start from the neural network and then I will explain every component of a neural network. If you feel like something is not right or need any help with any of this, Feel free to contact me, I will be happy to help.

When to use the Neural Network?

Let’s assume we want to solve a problem where you are given some set of images and you have to build an automated system that can categories each of those images to its correct label.

The problem looks simple but how do we come with some logic using raw pixel values and target labels. We can try comparing pixels and edges but we won’t be able to come with some idea which can do this task effectively or say the accuracy of 90% or more.

When we have this kind of problem where we have high dimensional data like Images and we don’t know the relationship between Input(Images) and the Output(Labels), In this kind of scenario we should use Neural Networks.

What is the Neural network?

Artificial neural networks, usually simply called neural networks, are computing systems vaguely inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain

A neural network is a set of neurons stacked in a way one after the other such that the neural network learns the relationship between the input and the output variable. It can solve all kinds of problems like classification, regression, or generative problems like next word prediction and Image captioning.

We already have a lot of algorithms in machine learning like SVM, logistic regression, linear regression…so many more which also do the same thing i.e they also try to learn the relationship between input and output variable, so why neural networks?

Why are Neural Networks used over Traditional Machine learning?

Traditional ML algorithms are good and they are not computed intensive But they do not work well on high dimensional data unstructured data such as images, audio, or text.

Traditional algorithms are still the building blocks of neural networks but they do not capture the relationship as good as neural networks. The neural network can learn any complex relationship given enough data and proper compute power.

Now that we know when to use a neural network, We can start exploring the components and how these things work.

What are the components of a Neural Network?

A Neural network has some basic components which are:

Neurons or layers
Loss function
Optimizer

What is a Neuron?

A Neuron is the building block of neural networks. A neuron has weights and bias for input that is fed to it.

Let’s assume we have a problem statement where we have a classification problem where we a set of features like Weight, height, BMI, Medical history, Age, and based on that we have to classify if a person is likely to have a heart problem or not.

Now, We want to give our neural network this data and want it to learn the mapping between these features and output( heart disease or not).

Let me introduce us to one of the functions used in neural networks. This function is a Sigmoid or Logistic function.

Sigmoid Function

Traditional Algorithms like logistic regression uses the same function which is First it will take up all the inputs or features and then assign each feature a weight W, In the end, it will pass it through a sigmoid function which spits out the probabilities.

Let’s assume we have 4 features X1, X2, X3, and X4, and based on that we want to classify if the person is likely to have heart disease or not. We can assume that the two classes are 1 and 0.

Weight matrix will look something like this:
[W1*X1 +W2*X2+ W3*X3+W4*X4] +bias→ One value

We will pass this value to the sigmoid function which is

Here X represents the value of [W1*X1 +W2*X2+ W3*X3+W4*X4] +bias

Suppose we have a value of X = 0 then e to the power→0 will give us 1 and then the whole expression will generate a value of 1/2. This means if the features and weights are zero we will get a probability of 0.5.

Here we only have two classes so the probability of 0.5 means the model is not sure about the predicted class i.e Both the classes have an equal chance when all of the computation is Zero.

If the value is high then the sigmoid will generate a value closer to 1 which means the chances of class 1 is higher.

2. If the value is low then the sigmoid will generate a value closer to 0 which means the chances of class 0 is higher.

These scores are dependent on the features and weights of the model i.e W’s and X’s that we used Since we can’t change the features we have to update the weight in such a way that the output is the expected output.

We also have one more term known as a bias that just shifts the sigmoid towards the right or left. This is just used to fit the data better.

The updating of the weights is the job for the optimizer which will be discussed later.

The idea is simply we have some features/inputs which are mapped to each weight and that dot product weight with the features are fed to the activation function known as sigmoid which generates a score.

This function is used in logistic regression and is heavily used in neural networks but in the case of the neural networks, it is used in the form of layers. One neuron in the layer is just a sigmoid function with some weights and bias.

Layers

The whole idea of neural networks is based on Universal Approximation Theorem.

The intuition behind this theorem is if we have a very complex function between the input and output. We can learn the approximation function by dividing that function into smaller chunks and each chunk is learned by one neuron or a part of the neural network.

Thus by stacking up layers of neurons, we can learn complex functions. If we want to learn more about that click here.

Now that we know about sigmoid and Universal Approximate Theorem. We can stack up neurons and form a layer.

This is how a neural network looks like.

This is a very basic neural network that has 3 neurons in the input layer which means it can take in 3 features as input.

It has 4 neurons in the hidden layer which represents 4 sigmoid functions. Each of them is learning a part of the complex function between input and output.

Finally, We have 2 neurons in the output which represents two categories.

Now that we have created an architecture we want the network to learn, We need some more components like loss function and Optimizer.

Loss Function

The loss function is just a mathematical expression that tells the network how good it is performing. We have different loss functions for different problem statements. Loss function defines what kind of relationship is the network trying to learn.

Regression Loss functions

If we want the network to predict something like the Air Quality index or rating of a restaurant that means we don’t want the network to predict classes or probabilities instead we want it to predict numbers in some range.

In that case, we would like to use mean square error or, root mean square error which will compare the value generated by the network with the Ground Truth or actual value, and Based on the difference of the two it will give loss value.

If the difference between the two values i.e. Predicted and True Value is high then the loss will be high else low.

Classification Loss functions

If we want the network to predict something like a person is likely to have heart disease or not, Music genre detection or Classify between the image of dogs and cats.

In that case, we will like to go for Binary cross-entropy or categorical cross-entropy which will take the predicted probabilities from the network and compare them with the actual probability distribution, based on the difference it will give a loss value.

We can also define our custom loss function if you want to solve some new problems.

Optimizer

The role of neurons and layers is to generate scores and the role of the loss function is to tell how far is the predicted score from the Ground truth or target.

The Optimizer comes into the picture after the loss is calculated, The optimizer just tries to find the relationship between the loss and weights and biases of the network. The goal of the optimizer is to bring the loss as low as possible so that the predictions that the model is making are closer to the target.

The optimizer will try to capture the relationship between each weight and bias in the network with the loss function.

Loss function = Some function of (weights and bias )

The derivative of any function w.r.t to some variable X tells us the relationship between that function with X. It gives us information about how much a function is going to change if the value of x changes and in which direction is it going to change.
The change in one weight = Loss function/ derivative(some Weight W)

Since the loss function is a function of multiple weights and biases the change is a partial derivative of that loss function with respect to one weight. This change is also known as Gradient

Now we know the relationship between loss function and weights. We can update the weights in such a way that the loss function is minimum. This process is run in parallel for each weight in the network and they are updated every time we calculate a loss function.

The Gradient is the positive change of loss function with respect to the weight which means if we update the weights with respect to Gradient then we will be increasing the loss instead of decreasing it so to avoid that situation we don’t add the gradient instead we Subtract the gradient every time we update the weight which results in decreasing of the loss function.

Updated Weight = Previous Weight- Gradient

Various Optimizers can be used such as Stochastic Gradient Descent, Mini Batch Gradient Descent, or Adam. Some ideas are used such as learning rate, momentum but the basic idea is gradient. This Algorithm is known as Gradient Descent

I will be writing some more on Neural networks where I will try to cover maths as well as the idea behind the maths.