Getting started with Neural Networks: For dummies

7 min readJan 7, 2023

Here’s the thing about Neural Networks a.k.a NN, it is a stone for a lot of up-and-coming development within ML. It might seem extremely complex to learn understand and follow and your brain cells might be malfunctioning trying to wrap the theory just like the gif above.

Conventional supervised machine learning models follow the tradition of having a target variable, from which the model can map and learn relationships between feature variables and target variables. This can be visualized as:

This can also be re-written with an equation as such:

Where;

X is the set of features

Y is the output

With this simplest understanding of the structure of an ML model, one can put this into perspective for predicting life expectancy as an example. In this example, I shall use the linear regression formula to generate a line formula for life expectancy as such

y = wx + b

Where;

y is life expectancy,

x is the life quality (can be thought of as a score, or an int),

b is the slope,

and w is the intercept

We can now frame this into the format of Fig. 2 as such:

Now, If you have followed until now! Great, take a break and try to identify what is wrong with the figure above(Fig. 3)…..

The problem is, many things in this universe are complex, we cannot just give a score to life quality, as there are so many complex variables that go into a person’s life expectancy. These are variables such as diet quality, nutrition intake, smoking, exercising, genetics, diseases, and unknown factors and it can go on and on and on!

All of these factors above can be described as Non-Linear Relationships. But, what is a non-linear relationship?!?

Non-Linear Relationships

Take a look at Fig. 1 and 2, you see there is one X which is the input, and one Y which is the output of the model, and we devised a linear regression model in which, linear indicates the relationship between the parameters that you are estimating, for example, w and b in the linear regression formula above, and the outcome.

But, this is straightforward, and in reality, nothing in life is as straightforward as it may seem. As discussed above, things are complex and have main factors, and these factors combined, affect the outcome, and these form a non-linear relationship.

Let’s visualize it by putting these various factors into the equation and see how the structure of Fig2. changes

As you can see in Fig 4. there are many factors, these factors can be grouped together under one umbrella, and all of these umbrellas can determine y which is the life expectancy.

Starting from the left side of the figure, every factor can be considered as a variable, denoted by x1, x2, x3, and so on to form our hidden layer. The circles or the nodes in between are the “neurons”, in this case, form our 1st layer, and all of them can determine in the end, your y.

Keep in mind, that Fig 4. is just an example, in reality for a neural network, the neurons are in far more layers than you think they are, hence they are known as hidden layers as the connections are non-comprehensible, and you have more than one neuron on the output layer that is y.

The idea here is to try an imitate what happens in a human brain when something is perceived.

If you have understood so far, great! If not, you can refer to the diagram below to understand the structure of the neural network better.

The Neural Network

One problem in the hidden layer, as discussed above is that we do not know how the variables are combined, therefore we must assume that any variable can interact with any of the variables, i.e., take all the possible combinations of the variables.

Each variable x1 through xn going to the hidden layer has a weight assigned to them, each neuron in the hidden layer has biases denoted by b and these biases have activation functions denoted as “sigma” σ.

Now, we can take the formula of life expectancy from above and rephrase is as the following.

The activation is usually the sigmoid function, and this in simple terms decides, if the neuron should be activated or not. If our input exceeds a threshold and is large, the activation function is large, and if our input is small, the activation function is small. There are 3 activation functions that are commonly used and these are Rectified Linear Activation (ReLU) Logistic (Sigmoid) Hyperbolic Tangent (Tanh).

Training the neural network

The reason Fig 7. is so big is that it is that important. When you train the neural network there are a set of parameters you have to set. For any given network topology such as Fig 6. and a set of activation functions, the parameters to set are weights and biases. Training the neural network will involve finding the best values for weights and biases that minimize the error in predicting the output y given as a set of inputs x.

In-depth, this means that:

you have a network.
you can have different activation functions for different layers.
and for each of these layers, you set weights and biases.
you set the weights and biases in such a way that you minimize the error.

Gradient Descent

Gradient descent is used as an optimization technique, and what are we optimizing here? the error value.

To minimize the error we can define the loss function as:

where J(W,b) is the loss function defined with weights and biases,

m is the number of data points

The loss function tells you the amount of change that happens when you change either W or b or both.

Let’s imagine a scenario before we “descent” — get it?

You are on top of a mountain, and you decide to go down. Your aim is to get to the station at the valley. if you go down or “descent” fast, you go up on the other side of the valley. If you do not go descent, you will never reach the valley. Gradient descent is about reaching the valley and staying at the valley, and this can be controlled via using weights and biases or W and b.

The gradient descent algorithm is:

compute the predicted output and loss function — J(W,b)
calculate the derivatives of W and b — dW^k = ∂J/∂W^k and db^k = ∂J/∂b^k
Update the parameters of W and b — W^k = W^k − α ⋅ dW^k and b^k = b^k − α ⋅ db^k where α is the learning rate

Forward and backward propagation

If you followed until now, have a look at Fig 6. for a second, starting from left to right, we have inputs, and the neural network gives an output. This is known as forward propagation. In general input data is fed through a network, in a forward direction, to generate an output. This comes with an error rate, now, you take this error rate and feed this loss backward through the neural network layers to fine-tune the weights and biases.

In general, the neural network runs in epochs or cycles, one epoch consists of 1 forward propagation and 1 backward propagation. The forward propagation is based on input data and the backward propagation is based on error data. To put this in steps:

Start by assigning random values to W and b.
using the gradient descent you can run epochs
repeat the epochs until you get convergence

convergence is a point of training a model after which changes in the learning rate become lower and the errors produced by the model in training come to a minimum.