Let’s Explore Neural Networks: Brain & Backbone of AI (Artificial Intelligence)—A Fresh Perspective 🕵🏽️‍

JAIGANESAN
The Hack Weekly — Data & AI Community
13 min readMay 13, 2024

Change your perspective on Neural Networks (MLP)!

Image by Gordon Johnson from Pixabay

Neural networks(Multilayer Perceptron) are often misunderstood, and it’s time to change that. In this article, we’ll take a step back and explore the basics of neural networks in a simple and easy-to-understand way. No complex math or computations here!

The Objective in this article is to change your perspective about Neural Networks. I will do my best to share my perspective.

To grasp neural networks, you need to understand some fundamental concepts of AI :

✨ 1. Input Layer ✨
✨ 2. Hidden Layer ✨
✨ 3. Output layer/node ✨
✨ 4. Loss Function ✨
✨ 5. Gradient optimization ✨
✨ 6. Activation function ✨
✨ 7. Optimizer ✨

1. Input Layer: The Gateway to the Network

The input layer is where the data enters the network for processing. It acts as a bridge between the external data and the network itself. A common misconception about the input layer is that it performs computations. Not true! The neurons in this layer simply pass the input data forward to the next layer without any processing. For example, if our input data has 10 independent features or columns of a table, we’ll have 10 input neurons (more on why I don’t like using the term “neuron” later 🤐).

2. Hidden Layer: The Computation Hub

A hidden layer is any layer between the input and output layers. It’s called “hidden” because its activity isn’t directly observable from the system’s input or output. This is where all the magic of AI happens — computations, activation functions and gradient updates (weight updates) take place. We’ll touch on this briefly later. Each neuron in a hidden layer receives input from the previous layer, performs some computation, and then passes the result to the next layer.

3. Output Layer/Neuron: The Final Output

The output layer is where the neural network’s final output is calculated. Depending on the desired result, you might have one neuron for regression or binary classification, or multiple neurons for other types of structure based on the use case. The operation is similar to the hidden layer, with each neuron receiving input, performing computations, and passing the result forward.

4. Loss Function: The Measure of Success

A loss function, also known as a cost function or objective function, is a crucial component in training AI models, including neural networks. It measures how well the model’s predictions match the actual target values in the training data. The goal is to minimize this loss function, indicating that the model makes accurate predictions.

5. Gradient Optimization :

Think of gradient optimization as trying to find the lowest point in a hilly landscape by taking small steps in the steepest downhill direction. In machine learning, it’s a technique used to tweak and adjust the parameters of a model to minimize errors or maximize performance.

6. Activation Function :

Activation functions are like the switches in a neural network. They determine whether a neuron should “fire” (activate) based on its input. They add non-linearity to the network, allowing it to learn complex patterns and relationships in data.

7. Optimizers :

Optimizers are the algorithms that help the model learn from data by adjusting its parameters. They work hand in hand with gradient optimization, guiding the model in the right direction during training. Think of them as the coaches that help a team (Our model) improve its performance over time. There are different types of optimizers, each with its strengths and weaknesses, but they all aim to make the model better at its task.

Diving Deeper into Neural Networks: Architecture and Function!

Now that I’ve covered some basic terms, let’s explore the architecture and function of neural networks in more detail.

The Correct Architecture of a Neural Network! 😶

First, let’s take a look at the correct architecture of a neural network.

You More often see the Architecture as a Neural Network (MLP)

Image 1: Created by the author

And here’s a modified version:

Image 2: Created by the author

In going from Image 1 to Image 2, the main change lies in the input layer. This is because there aren’t any operations occurring at the input layer, so that’s where the distinction should be highlighted. You’ve probably come across images depicting the weights in connections between neurons, where you’ll see labels like w11, w12, w21, and so on. Typically, these images end with a representation of the loss function.

What’s Happening Inside a Neural Network, so-called Neuron

Image 3: Source: https://machinelearningmastery.com/calculus-in-action-neural-networks/

z is the weighted sum computed by the neuron in the hidden layer.
𝑤 i is the weight Multiplying with Input
x i is the output of the neuron in the previous layer/input layer.
b is the bias term associated with neurons in the hidden layer/output layer.

Let’s Change Your Perspective 🪄

Let’s consider a regression problem with 4 independent variables (4 columns of a table data) and 1 dependent variable (target variable that we are going to predict using AI).

We’ll use the Image 2 neural network structure:

Adopting the Above neural network architecture for the problem

  1. Input Layer — 4 input nodes
  2. 1st Hidden Layer — 5 Neuron
  3. 2nd Hidden Layer — 5 Neuron
  4. Output Layer — 1 Neuron

Note: In this article, we’ll be using the term “matrices” to refer to both weights and biases, even though biases are typically represented as vectors and weights as matrices.

Let’s have a batch size of 10. Input data size will be 10 X 4 (10 Records/rows, 4 Independent Features/columns ). let’s go through the steps:-

Important Note to Understand the Linear Layer :

Weight_Matrix( Current Layer No.of Neurons, Previous Layer or Input Layer No.of Neurons).

We also Initialize the Linear Layer with input size and output size (nn.linear (input vector size (4), output vector size (5))).

For example, If we have 4 (Input vector size) input units in the MLP/FFN and 5 Neurons in the hidden layer, the Weight Matrix size will be W(5,4).

This Matrix (Transposed) Multiplied with Input vectors, gives a new vector.

Then the bias vector is added to the Linearly transformed vector. This is Linear Vector Transformation. Vectors transformed from one dimension(4) to other dimensions (5) or the same dimensions. This is the same operation that you have studied about Neurons.

Step 1: Input Layer to 1st Hidden Layer

Note: The numbers shown in the images are for sample reference.

Image 4: Image created by the author
Image Linear: Linear function (Neural Network ). Source: pytorch.org

Image Linear : Shows the linear transformation in a Neural Network. x is the input matrix ( A in our case ), and A is the weight matrix ( In our case W1, W2, W3). b is the bias matrix.

Let me break down Image 4 for you: Here, It’s showing the multiplication process between the scaled input matrix, which is 10 by 4 in size, and the weight matrix, which is 5 by 4.

In simpler terms, we’re dealing with 4 inputs and 5 Neurons in the current hidden layer.

You’ll notice R1, R2, R3, R4, and R5 in the image, which indirectly corresponds to what we might typically call “neurons” in the hidden layer. The resulting matrix will be 10 X 5.

Step 2: Weighted Sum and Bias

Image 5: Image created by the author

Let’s talk about Image 5 :

Once we’ve multiplied the input matrix A and the weight matrix W1 together, we get a new matrix size (10, 5). Then comes the step where we add this resulting matrix to every row of a bias matrix that’s 1 by 5 in size. This addition is called broadcasting.

The bias adds a bit of variation to the outcome, which is actually beneficial for neural networks. It helps them learn more intricate patterns and become more versatile, leading to a more well-rounded neural network.

Step 3: Activation Function

Image 6: Image created by the author

Once we’ve computed the weighted sum for each neuron, it’s time to add some flavor to our neural network by introducing non-linearity. This is where activation functions step in, like sigmoid, tanh, or ReLU, among others.

They work their magic by transforming the weighted sum into something more nuanced — the output of the neuron. Think of it as adding curves and twists to our model’s understanding of the data, steering it away from being too straightforward and linear. We’re aiming for complexity here, wanting our neural network to grasp the intricate patterns in the data.

In our case, we’ve opted for the ReLU f(x)= max(0,x) activation function, which does a neat trick of converting all negative values to 0. You can see this transformation in action in Image 6, representing the output of the first hidden layer.

Step 4: 1st Hidden Layer to 2nd Hidden Layer

Now the 1st hidden layer output will become the input of the 2nd Hidden layer, and steps 1–3 repeat again in the 2nd Hidden Layer.

Image 7: Image created by the author
Image 8: Image created by the author
Image 9: Image created by the author

Step 5: 2nd Hidden Layer to Output Layer

Now the 2nd layer Output Becomes Input of the last/ output layer. Step 1-2 Repeats again.

Image 10: Image created by the author

Step 6: Calculating Loss

Image 11: Image created by the author

Now we are in the final step, the Calculated/predicted Values Subtracted from the true value (though it is a Regression Use case). And the Total Loss value has been calculated. We have used MSE (Mean Squared error ) in this use case to calculate the loss.

MSE = Total Sum (Predicted value — True Value)²

Why we have used MSE, the result will be easily differentiable and converge faster compared with MAE.

If we skip applying an activation function, what will be the outcome?

Imagine we have this set of equations:

1. Z1 = A. (W1)^T + B1
2. Z2 = (Z1 . (W2)^T) + B2
3. Z2 = ((A . (W1)^T) + B1) .( W2)^T + B2
4. Z2 = ((A . (W1)^T). (W2)^T + B1 .( W2)^T + B2
5. Z3 = (Z2 .( W3)^T) + B3
6. Z3 = (((A .( W1)^T) .( W2)^T + B1 .( W2)^T) + B2) . (W3)^T + B3
7. Z3 = A . (W1)^T .( W2)^T .( W3)^T + B1 . (W2)^T . (W3)^T + B2 .(W3)^T + B

In simple terms, if we don’t apply any activation function, our output will merely be a linear combination of the input data.

This means our Neural Network (MLP) won’t be capable of learning any non-linear relationships between input and output. Essentially, it won’t be able to capture the complexities often present in real-world data.

Without activation functions, our model won’t be creating any curves or intricate patterns to understand the data better. We don’t want any line or hyper plane.

Now the Import step comes in training :

Backward Propagation :

Backpropagation, short for “backward propagation of errors,” is a fundamental algorithm used to train. It’s a key component of the training process in which the network learns to adjust its parameters (weights and biases) to minimize the difference between its predictions and the actual target values. This process involves calculating the gradients of the loss function with respect to the parameters (weights and biases) of the network.

Once the gradients ( Slope — where or which direction the parameter should move in order to reach global minima ) have been computed, the optimization algorithm is used to update the parameters of the network in the direction that reduces the loss. This involves adjusting the weights and biases by small amounts proportional to the negative of the gradients, aiming to minimize the loss function.

In this process, a gradient optimization algorithm is used, and you often come across

New weight/New Bias = Old weight / Old bias — learning rate ( Derivative of Loss with respect to Old weight )

Image 12: Image created by the author

Here’s a breakdown of the optimization process: We’re figuring out how the loss changes with respect to a specific weight, essentially calculating its derivative. Then, using this information, we employ gradient descent, a method where we adjust the model’s parameters — like weights and biases — based on the error estimation.

The learning rate plays a crucial role here. It’s like a dial that controls how big the steps are during training optimization. Set it too high, and we might leap over the optimal solution or bounce around it, causing instability and slow learning. But set it too low, and we risk crawling toward the solution at a snail’s pace or getting stuck in less-than-ideal spots. So, finding the right balance is key for efficient and effective training.

Alright, let’s delve into some calculus now. You’ve probably heard of the chain rule before — it’s like the backbone of backpropagation in neural networks. Essentially, it’s a rule in calculus that helps us figure out how to find the derivative of a function that’s composed of other functions. In the context of backpropagation, we use this rule to calculate how the loss changes with respect to each weight in our neural network. It’s like breaking down a big problem into smaller, more manageable pieces.

Image 13: Image created by the author

For each layer, we compute two gradients:

  1. Local Gradient: The gradient of the loss function with respect to the output of the layer. This represents how much the loss would change if the output of the layer changed, holding the parameters of the layer constant. This is computed using the gradient of the loss function with respect to the output of the layer and the gradient of the layer’s output with respect to its input.
  2. Parameter Gradient: The gradient of the loss function with respect to the parameters of the layer. This represents how much the loss would change if the parameters of the layer changed, holding the output of the layer constant. This is computed using the local gradient and the gradient of the layer’s output with respect to its parameters.

By chaining together these local and parameter gradients layer by layer, we can efficiently compute the gradients of the loss function with respect to all the parameters of the network, which allows us to update the parameters using gradient-based optimization algorithms like SGD, SGD with Momentum, RMS Prop, Adam.

Convergence to Global Minima

This training process continues until the gradients reach the global minima or the number of epochs we’ve initialized earlier. To illustrate this, imagine a scenario where we have two parameters, similar to longitude and latitude, initially located near the UK. The loss is high at this initial coordinate (UK), so we need to adjust it using the cost function and optimizer. As we converge towards the global minima, the coordinates slowly move from the UK to India, where the loss is low. (Co-ordinates → Coefficients or weights)

Number of Parameters in the Model

Let’s calculate the number of parameters in our model:

  1. Input layer to 1st Hidden Layer: 20 weights + 5 biases
  2. 1st hidden layer to 2nd Hidden Layer: 25 weights + 5 biases
  3. 2nd Hidden layer to Output layer: 5 weights + 1 bias

Total: 50 weights + 11 biases = 61 parameters

These 61 parameters are adjusted (Imagine 61 parameters are moving in multi-dimensional space ) until they reach the global minima using optimizers, cost functions, and gradient optimization algorithms.

Neural Network Analogy! 👼

To better understand neural networks, imagine drawing a picture by connecting the dots. At the end of connecting the dots, you get a reasonable picture. Similarly, all 61 parameters need to be in the right position to capture complex patterns in the training data. When the network sees the test data, it will give good results, similar to a reasonable picture.

Improving the Neural Network ( MLP — Multilayer Perceptron )

This basic neural network can be improved using various techniques, such as:

  1. Normalization techniques (Batch and Layer Normalization)
  2. Hyperparameter tuning
  3. Different activation functions and optimizers
  4. Regularization techniques (L2, L1, Dropout, Drop Connect)

Remember, everything in understanding AI (neural networks) boils down to one thing — That is MATH — Be it Statistics, Algebra, or Calculus, you name it!

Hope, I’ve managed to clarify most important aspects of neural networks, particularly the MLP (multi-layer perception).

Don’t think of it as a neuron; think of it as matrices and vectors, you will get to understand it better!

Are you interested in learning more about Neural Networks and their foundational concept?

Read about the Universal Approximation Theorem (UAT), to gain a deeper understanding.

References :

  1. https://www.ibm.com/topics/neural-networks
  2. https://www.geeksforgeeks.org/neural-networks-a-beginners-guide/
  3. https://pytorch.org/docs/stable/generated/torch.nn.Linear.html

Thanks for reading this article 🤩

If you found this article useful — Give it all your 50 Claps👏 to show your encouragement, and help me keep Motivated everyday to write more!

Feel free to follow for more insights.

Let’s stay in touch on LinkedIn— ❤️to keep the conversation going!

See you again next time, have a great day ahead

--

--