Deep Dive into Neural Networks

11 min readAug 30, 2023

What is Neural Network?

A neural network serves as a computational framework inspired by the intricate organization and functioning of the neural networks found in the human brain. This technology stands as a fundamental cornerstone within the realms of machine learning and artificial intelligence. These networks are composed of interlinked nodes, commonly referred to as “neurons” systematically arranged in layers to process information.

Neural Network Layers

Input layer is the initial part of the network where raw data is introduced. It serves as the entry point for the information that the network will process and learn from. The number of neurons in the input layer is determined by the dimensions of the input data.

Hidden layer is an intermediate layer located between the input layer and the output layer. Hidden layers are where most of the computation and feature extraction take place. These layers play a crucial role in enabling the network to learn complex patterns and relationships in the data.

Output layer is the final layer that produces the network’s predictions or outcomes. It’s the part of the network where the processed data from the previous layers is transformed into the desired form for the specific task the network is designed to solve.

What Is Neural Network Weight and Bias?

- Weights

At the heart of a neural network’s operation is the skill to move data forward, aptly called forward propagation. This process is where the significance of weights and biases becomes apparent. Think of weights as traffic directors for the connections between basic units. They guide the flow of information, dictating how much impact signals from one unit have on the next. As the network learns, these weights adapt — increasing or decreasing — allowing the network to sharpen its grasp. Then comes the pivot to backward propagation. Here, the network’s path retraces, meticulously fine-tuning connections to improve precision. Layers are revisited, nodes are reevaluated, all to achieve the goal of heightened accuracy.

- Biases

Think of biases as the friendly guides within the neural network ensemble. Unlike the regular members, biases have a unique role. They’re like little road signs strategically placed to steer data in the right direction. Imagine them nestled among the data units, working quietly but profoundly to influence decisions. biases can’t be part of the initial data group. But when the network fine-tunes its understanding during backward flow, biases shine. Just like weights, biases step into the limelight. They ensure the final outcome isn’t just accurate but also precisely tailored. And the intriguing part? Even when initial units are inactive, biases can ignite a spark, propelling data forward with a clear sense of purpose.

What Is a Activation Function?

Activation functions, the crucial building blocks of neural networks, hold the remarkable ability to sculpt how networks respond. Beyond generating simple outputs, these mathematical formulas wield influence over broader aspects — steering convergence, influencing speed, and even determining if convergence is attainable. Moreover, activation functions step in as conductors of output normalization, confining results within specific ranges like 1 to -1 or 0 to 1.

Efficiency becomes a cornerstone for these functions. Their effectiveness doesn’t solely rest in shaping outcomes; it extends to optimizing computation time. This becomes especially vital when neural networks tackle massive datasets brimming with millions of data points. The worth of an activation function is measured by its capacity to streamline computations, allowing neural networks to adeptly learn and evolve amidst monumental data volumes.

Activation Function Types

1- Sigmoid

The sigmoid activation function is a mathematical function that transforms input values into a range between 0 and 1. It’s commonly used in neural networks, especially in the output layer of binary classification models, where the goal is to predict probabilities for two classes.

f(x) = 1 / (1 + exp(-x))

2- Tanh (hyperbolic tangent)

The tanh (hyperbolic tangent) activation function is a mathematical function that transforms input values into a range between -1 and 1. It’s a commonly used activation function in neural networks, often used in hidden layers due to its balanced properties.

f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

3- ReLU (Rectified Linear Unit)

Among the array of activation functions, ReLU stands as a workhorse, widely embraced for its simplicity and efficacy. It functions by allowing positive input values to flow unchanged, while intercepting negative inputs and transforming them to zero.

f(x) = max(0, x)

4- Leaky ReLU

The Leaky ReLU activation function presents a refined approach to the traditional ReLU function. With the original ReLU, the challenge arises when inputs fall below zero, causing neurons to deactivate and possibly resulting in the “dying ReLU” issue.

Leaky ReLU comes to the rescue by countering this problem. Unlike ReLU, where negative inputs are flatlined to zero, Leaky ReLU introduces a clever twist. Instead of abruptly cutting off, it allows a minuscule linear component of the input (typically 0.01 times x) to shine through.

f(x) = max(0.01 * x, x)

5- Maxout

The Maxout activation function is a versatile activation function that can approximate a wide range of other activation functions. It’s designed to address some of the limitations of traditional activation functions like ReLU and sigmoid. The Maxout activation function works by taking the maximum value among a set of linear combinations of the input.

f(x) = max(w1*x + b1, w2*x + b2)
#w1 and w2 are weights
#b1 and b2 are biases

6- ELU (Exponential Linear Unit)

ELU offers an enticing blend of performance and versatility, making it a favorite in the neural network realm. For positive inputs, ELU behaves just like the identity function, passing the input unchanged. However, for negative inputs, ELU introduces an exponential curve that gently tapers off, providing a non-zero output.

f(x) = x if x > 0, else a * (exp(x) - 1)

Forward Propagation

Forward propagation constitutes a foundational step within neural networks, orchestrating the transformation of input data into predictions or outputs. It’s a journey that unfurls as data embarks on a passage from the input layer, coursing through hidden layers, and culminating in the output layer. With each layer, data undergoes metamorphosis, shaped by weighted connections and the influence of activation functions.

The forward propagation process commences by populating the values of the neurons within the first hidden layer. This is achieved by leveraging the input data and applying an appropriate activation function, which contributes to the distinctiveness of each neuron’s value. Subsequently, these calculated values are used to influence the neurons in the subsequent hidden layer, bestowing them with meaningful inputs. The culmination of this intricate progression leads to the generation of predictions through the neurons in the output layer.

This process of forward propagation is a dynamic iterative sequence that can be repeated across multiple cycles. With each iteration, the process remains consistent, extending its reach deeper into the network’s architecture. This iterative nature harnesses the power of the neural network to gradually refine predictions, offering a more informed and accurate outcome.

Let’s embark on a journey through a neural network employing the ReLU activation function at every layer, illuminating how computations traverse and transform.

Imagine our journey starts with a calculation: (3 * 2) + (5 * 4) = 26, a spark of positivity. This number forms the bedrock for the first node in the initial hidden layer. ReLU steps in and, max(26, 0) equals 26 — an activation anthem.

Now, let’s switch lanes. Another calculation takes the spotlight: (5 * -5) + (3 * 4) = -13, a shadow of negativity. This brews the essence for the second node in the same hidden layer. Here, ReLU works its magic — max(-13, 0) is a determined -13.

Another calculation unfolds: (26 * -1) + (0 * 1) = -26. The stage shifts to the first node in the second hidden layer. ReLU sweeps in — max(-26, 0) finds its calm at 0.

Now : (26 * 2) + (0 * 2) = 52, a harmony of positivity. This ensemble finds its way to the second node of the second hidden layer. ReLU’s touch persists — max(52, 0) flourishes into 52.

Finally, the finale: (0 * 3) + (52 * 7) = 364. The stage is set for the output node. ReLU’s transformation melody swells — max(364, 0) stands tall at 364.

In this calculations, activations, and transformations, ReLU takes on the role of the conductor, guiding the narrative of data’s passage through the layers of the neural network, revealing patterns, and heralding predictions.

Loss Function

In the realm of training a neural network, a loss function — sometimes referred to as a cost function or objective function — holds a pivotal position. Its role is to measure the gap between the predicted values, woven by the network, and the actual target values residing in the training dataset. This disparity evaluation stands as a testament to how adeptly the network’s forecasts align with the absolute reality, functioning as a guiding compass for the meticulous tweaking of the network’s parameters throughout training.

The heart of a loss function lies in its ability to harmonize a multitude of prediction errors emanating from an array of data points into a single, cohesive numerical representation.
At its core, a lower value of the loss function signifies a more finely tuned and proficient model.
Lower loss function value means a better model.

Gradient Descent

We are presented with a curve that represents the loss function on the vertical axis, plotted against different values of the weight on the horizontal axis. Our objective is to identify the nadir of this curve, as it signifies the point at which our model achieves its maximum accuracy.

To analyze this curve, we introduce a tangent line that intersects it at our present position. The inclination of this tangent line mirrors the rate of change of the loss function concerning our current weight. This inclination is fundamentally associated with a mathematical concept called the derivative.

Utilizing the derivative, we determine the direction in which to advance. When the slope of the tangent line is positive, it implies a descent in the loss function. Therefore, we proceed in the contrary direction of the slope, moving towards lower values. This procedure is iterated until it is no longer possible to proceed downhill any further.

Learning rate

The learning rate is a critical hyperparameter in the training process of neural networks and other optimization algorithms. It determines the size of the steps taken towards minimizing the loss function during each iteration of the training process. Think of it as the step size you take while descending a hill to reach the lowest point.

Choosing an appropriate learning rate is essential, as it can significantly impact the training process and the final performance of the neural network.

Here’s how different learning rate values affect the training process:

1- High Learning Rate:

A high learning rate can cause the optimization process to overshoot the optimal solution.
This results in the loss function fluctuating and failing to converge to a minimum.
It leads to unstable training and prevents the model from achieving its best performance.

2- Low Learning Rate:

A low learning rate makes the optimization process very slow.
The model takes small steps in parameter space, requiring numerous iterations to converge.
While it can eventually find a good solution, this approach is computationally demanding.

3- Appropriate Learning Rate:

An appropriately chosen learning rate ensures efficient convergence without overshooting.
It leads to stable and rapid training convergence, resulting in improved overall training results.

Back Propagation

You’ve used gradient descent to optimize weights in a simple model. Now we’ll add a technique called “back propagation” to calculate the slope you need to optimize more complex deep learning models.

Much like how forward propagation guides input data through the concealed layers to the output layer, backpropagation undertakes the task of transmitting the error from the output layer and retracing its steps through the hidden layers in the direction of the input layer. This intricate process sequentially computes the essential gradients, originating from the weights adjoining the prediction, traversing the hidden strata, and ultimately reaching the weights connected to the input data. These gradients are subsequently employed to adjust the weights, facilitating the refinement of the entire network.

Back Propagation process:

Backpropagation serves as a foundational algorithm for training neural networks by refining their internal parameters via gradient descent. This iterative process involves determining how each weight contributes to the slope of the loss function, thus guiding the network towards minimizing overall errors.

1- Forward Pass:

Start with an input vector x.
Calculate the weighted sum and apply the activation function for each neuron in each layer to get the output of each neuron.
The output of one layer becomes the input of the next layer. This process continues until the final layer, producing the predicted output y_pred.

2- Loss Calculation:

Calculate the loss between the predicted output y_pred and the actual target y_true using a suitable loss function, such as mean squared error (MSE) for regression or cross-entropy for classification.

3- Work Backwards Through the Network Using the Chain Rule:

Imagine that this error at the output layer is influenced by many factors from the layers before it.
We want to understand how much each of these factors contributed to the overall error.
We work backward through the layers, considering each layer’s role in the error using the chain rule from calculus.
It’s like understanding which parts of the earlier layers are responsible for the mistakes we see in the output.

4- Gradient Calculation:

As we move backward, we calculate two important things at each layer.
First, we calculate how much the ‘total input’ to each neuron in that layer (the weighted sum of inputs) contributed to the error.
Second, we calculate how much the activations (output values) of that layer contributed to the error.
These two calculations give us insights into how to adjust the weights and biases of the network to reduce the error.

5- Parameter Update:

We use the insights gained from the gradient calculations to understand how much each weight and bias should be tweaked to make the network’s predictions better.
Bigger tweaks are made to the weights and biases that had a stronger influence on the error.
Smaller tweaks are made to those that had a weaker influence.
This step is like fine-tuning the neural network knobs to make it more accurate.

6- Repeat:

Iterate through the entire process multiple times, referred to as epochs.
During each epoch, execute the forward pass, compute the loss, carry out the backward pass, and adjust parameters based on gradients.
Repeat this sequence until a predetermined number of epochs or until the loss converges to a satisfactory level.

In summary, neural networks mirror the brain’s functioning, propelling advancements in AI and machine learning. They revolutionize data processing, from fundamental elements like weights, biases, and activation functions to intricate mechanics. Forward propagation, powered by activations, uncovers hidden patterns. Loss functions and gradient descent enhance accuracy. Backpropagation refines deep learning. Balanced learning rates unlock neural networks’ potential. Ultimately, this layered journey enables machines to understand and predict our complex world, guided by architectural principles.