Understanding Backpropagation: The Engine Behind Neural Networks

Understanding the Basics of Backpropagation

Mohit Mishra
The Deep Hub
7 min readMar 31, 2024

--

Neural networks, which are inspired by the human brain, are extremely effective machine learning tools. They are made up of interconnected layers of “neurons” that process and transform data, eventually learning to perform complex tasks such as image recognition and translation. But how do these networks really learn? The answer lies in a critical algorithm known as backpropagation.

Training a neural network include feeding it data and iteratively adjusting its internal parameters (weights and biases) to reduce the gap between its predictions and the desired outputs. This difference is quantified by a loss function, which serves as a compass, directing the network toward improved performance.

However, simply calculating the error at the output layer is insufficient. We need to understand how each neuron in the network adds to the overall error. This is where backpropagation comes in, allowing us to calculate each parameter’s contribution and adjust it as needed.

The Problem: Evaluating Errors

Imagine a neural network tasked with classifying images of cats and dogs. If the network misclassifies a cat as a dog, we need to understand which neurons in the network contributed to this error and how much. This is where the loss function plays a crucial role. It measures the discrepancy between the network’s prediction and the true label, providing a single number that quantifies the error.

However, just knowing the overall error at the output layer isn’t enough. We need to distribute this error back through the network to understand how each neuron contributed. This is where the limitations of local error become apparent. Evaluating the error only at the output layer doesn’t provide enough information to effectively update all the parameters in the network.

The Solution: Backpropagation Algorithm

A. Terminology

Before diving into the algorithm, let’s clarify some key terms:

  • Weights: These are the parameters that determine the strength of the connections between neurons. Each connection has a weight associated with it, influencing how much one neuron’s output affects another.
  • Biases: These are additional parameters that act as offsets, influencing the neuron’s activation.
  • Activation functions: These are mathematical functions applied to the weighted sum of inputs at each neuron, introducing non-linearity and allowing the network to learn complex relationships.

B. The Chain Rule: A Mathematical Ally

The chain rule is a fundamental concept in calculus that allows us to calculate the derivative of a composite function. In the context of neural networks, it helps us efficiently calculate the gradient of the loss function with respect to each weight and bias in the network. The gradient essentially tells us how much the error changes with respect to a small change in a particular weight or bias.

Now to understand this better let’s consider a scenario: Consider a simple neural network with one hidden layer and one output neuron. We’ll focus on calculating the gradient of the loss function with respect to a weight in the hidden layer.

Source: Image by Author

Before diving deep into mathematics let’s take a pause and take a look into notations:

  • w: weight in the hidden layer
  • h(w): output of the hidden layer neuron (activated value)
  • o(h): output of the final neuron (predicted value)
  • L(o): loss function, measuring the error between predicted and actual values

Objective: Calculate ∂L/∂w, the gradient of the loss function with respect to the weight in the hidden layer.

Applying the Chain Rule:

The chain rule states that for a composite function y = f(g(x)), the derivative of y with respect to x is:

Source: Image by Author

In our neural network example, we have a chain of functions: the weight influences the hidden layer output (h), which in turn affects the final output (o), which finally determines the loss (L). So, we can apply the chain rule to find the gradient of the loss with respect to the weight:

                     ∂L/∂w = (∂L/∂o) * (∂o/∂h) * (∂h/∂w)

Breaking Down the Terms:

  • ∂L/∂o: This represents how much the loss changes with respect to the final output. It depends on the specific loss function used (e.g., mean squared error).
  • ∂o/∂h: This represents how much the final output changes with respect to the hidden layer output. It depends on the activation function used in the output neuron.
  • ∂h/∂w: This represents how much the hidden layer output changes with respect to the weight. It depends on the activation function used in the hidden layer neuron and the input to the neuron.

Step-by-Step Calculation:

  • Calculate the error term at the output layer (∂L/∂o) based on the chosen loss function and the difference between the predicted and actual values.
  • Propagate the error term backwards to the hidden layer by multiplying it with ∂o/∂h, which is calculated based on the output neuron’s activation function.
  • Calculate ∂h/∂w based on the hidden layer neuron’s activation function and input.
  • Multiply all three terms to obtain the final gradient ∂L/∂w.

I’ve also included a simplified Python code below, which assumes the predicted value, actual value, and hidden layer output are already calculated during the forward pass.

The Backpropagation Process: A Step-by-Step Breakdown

A. Forward Pass: Data Processing Flow

During the forward pass, data flows through the network layer by layer. Each neuron receives a weighted sum of the outputs from the previous layer, applies its activation function, and passes its own output to the next layer. This process continues until the final output is produced.

B. Backward Pass: Error Propagation

  • Error at the Output Layer: After the forward pass, the error is calculated at the output layer based on the chosen loss function. This error represents the discrepancy between the network’s prediction and the true label.
  • Propagating the Error Backwards: This is where the chain rule comes into play. We use it to calculate how much each neuron in the previous layer contributed to the error at the output layer. This process is repeated layer by layer, propagating the error signal backwards through the network.
  • Calculating Gradients: At each layer, we use the propagated error to calculate the gradients of the loss function with respect to the weights and biases of the neurons in that layer. These gradients tell us how much each parameter contributed to the overall error.

Updating the Network: The Power of Gradients

The calculated gradients are used to update the weights and biases of the network. We adjust each parameter in the direction that minimizes the loss function, essentially nudging the network towards making better predictions. The learning rate controls the size of these updates, ensuring that the network doesn’t overshoot the optimal values.

In our next blog, we will go into greater detail about how network updates work and the true power of gradients. To fully comprehend this, we must go deeper and build some basic functions from scratch. I’ll be working on backpropagation code that doesn’t use any other libraries besides numpy. As this will greatly assist in understanding how all of these things work under the hood.

Addressing Challenges

While backpropagation is a powerful tool, it faces some challenges:

  • Vanishing/Exploding Gradients: In deep networks, gradients can become very small or very large during backpropagation, hindering the learning process. Techniques like ReLU activation functions and careful initialization help mitigate these issues.
  • Optimization Algorithms: Several advanced optimization algorithms, such as Adam and RMSprop, build upon backpropagation and improve its efficiency by dynamically adjusting the learning rate for different parameters.

We will discuss this further once we have a better understanding of how the network updates. Backpropagation will include 3–4 articles to explain all of this, as cramming all of the information into one blog is not good for someone who wants to understand it, and it also reduces the content’s readability.

My name is Mohit Mishra, and I’m a blogger that creates intriguing content that leave readers wanting more. Anyone interested in machine learning and data science should check out my blog. My writing is designed to keep you engaged and intrigued with a regular publishing schedule of a new piece every two days. Follow along for in-depth information that will leave you wanting more!

If you liked the article, please clap and follow me since it will push me to write more and better content. I also cited my GitHub account and Portfolio at the bottom of the blog.

You can follow me on Twitter to learn more about this. Every day, I tweet about a variety of subjects here, such as software engineering, system design, deep learning, machine learning, and more.

All images and formulas attached have been created by AlexNail and the CodeCogs site, and I do not claim ownership of them

Thank you for reading my blog post on Understanding Backpropagation: The Engine Behind Neural Networks. I hope you find it informative and helpful. If you have any questions or feedback, please feel free to leave a comment below.

I also encourage you to check out my portfolio and GitHub. You can find links to both in the description below.

I am always working on new and exciting projects, so be sure to subscribe to my blog so you don’t miss a thing!

Thanks again for reading, and I hope to see you next time!

[Portfolio Link] [Github Link]

--

--

Mohit Mishra
The Deep Hub

My skills include Data Analysis, Data Visualization, Machine learning, and Deep Learning. I have developed a strong acumen for problem-solving, and I enjoy ML.