Backpropagation algorithm part 3

5 min readJun 19, 2023

The Derivative, Partial Derivative, and Chain Rule

In Gradient Descent, we have used the derivative of a function. In this lesson, we will show a few more important derivatives and also discuss the partial derivative. We will introduce some mathematical notation that you will come across in books on neural networks. Don’t be discouraged by this notation; it is just a way of writing things. Lastly, we will cover the chain rule, which is the heart of backpropagation.

The Derivative

The derivative of a function is a way to determine the tangent line to the graph at a specific point. This is illustrated in the figure below.

a quadratic function f(x)=x² with derivative f’(x)=2x

For the example above, we have seen that it yields the following tangent line:

y = -3x — 2.25.

And this tangent line applies to the point x = -1.5. We can do this for any arbitrary function. We will explore more examples to become more familiar with the derivative, as it is the core concept behind backpropagation.

The sine wave function

The sine function is defined as:

f(x) = sin(x)

And the derivative of the sine function is the cosine:

f’(x) = cos(x)

On my GitHub page, you can find an example in Python on how to calculate the tangent line. Try varying the value of x to see how the tangent line changes.

The exponential function

The exponential function is described as follows:

f(x) = exp(x)

The derivative of the exponential function is equal to the function itself, so:

f’(x) = exp(x)

The python implementation can be found here.

The exponential function. The red line is the tangent line at point x=0

The sigmoid activation function

Now, let’s see the derivative of a well-known activation function, the sigmoid. The sigmoid is defined as:

f(x) = σ(x)

The derivative of the sigmoid is:

f’(x) = σ(x)(1 — σ(x))

The sigmoid activation function with the tangent line at x=2.5

For an overview of all functions and derivatives, I refer you to the following link.

Notation

Often, in books, you may come across the derivative written as a fraction:

Don’t be discouraged by this notation; it is just a way of writing things. Later, we will see that this notation is convenient and intuitive.

The Partial Derivative

The functions we have seen so far are functions of a single variable. However, we can also have functions with multiple variables. These functions can be plotted in a 3-dimensional plane or even in higher dimensions. In higher dimensions, you can also calculate the derivative for each dimension. This is done by keeping all dimensions constant (considering them as constants) except the dimension of interest. You calculate the derivative with respect to that dimension. Let’s illustrate this with an example:

f(x, y) = x² + 2x + y² + 3y

This is a function of two variables, x and y. When we take the derivative with respect to x, we hold y constant. We will use the special notation:

The y term disappears because it is a constant. We can also take the derivative with respect to y. It looks as follows:

Both derivatives can now be used in Gradient Descent (more on this in future lessons).

The Chain Rule

The last rule we will cover is the chain rule. The chain rule reminds me of the movie “Inception.” In the film, there is a dream within a dream within a dream. The chain rule is just as futuristic because now we have a function within a function. Let’s look at an example. Consider the following function:

f(x) = (sin(x))²

Here, we have a quadratic function. Inside the quadratic function, there is a sine function. The chain rule states that you first take the derivative with respect to the quadratic function and then multiply it by the derivative with respect to the sine function. This results in the following derivative:

f’(x) = 2sin(x)cos(x)

Notation:

You can define one function within another. It looks like this:

f(x) = h(g(x))

So, there is a function h() which is x², and the function g(x) is sin(x). The chain rule now states:

f’(x) = h’(g(x)) * g’(x)

where

h’(g(x)) = 2sin(x)

and

g’(x) = cos(x)

This function is worked out on my GitHub page.

We also had the fraction notation. Let’s introduce some new variables.

z = h(y) = y²

y = g(x) = sin(x)

Then, the chain rule becomes:

You can see how the derivative with respect to x is calculated indirectly via the intermediate variable y. The fraction notation will be useful when formulating the backpropagation rules.

Let’s look at another example of the chain rule. Consider the following function:

f(x) = sin(x²)

The outer function is the sine, with its derivative being the cosine. The inner function is x², with its derivative being 2x. The result is:

f’(x) = cos(x²) 2x

An example of this in Python can be found on my GitHub page.

With this, we conclude this lesson. You have now learned about the derivative, the partial derivative, and the chain rule. In the next lesson, it is time to demonstrate a practical example using backpropagation. We will cover linear regression.

Backpropagation algorithm part 3

Written by Herman Van Haagen