## Mathematics

# Gradient Descent Visualization

## Visualize SGD optimization algorithm with Python & Jupyter

# Introduction

In this article I am aiming to provide a good visual perspective to understand the Stochastic Gradient Descent (SGB for short) algorithm, this will hopefully give a good understanding to You.

Some of the math is represented in this article, but I advise You to read further on other materials for a deeper understanding.

Visualizations were made in **Python**, with the help of **Jupyter Notebook**, **Matplotlib**, **Numpy** libraries.

Have fun!

## Use cases

SGD is mainly used in math and engineering sciences. I plan to approach SGD with a Computer Scientist (Machine Learning) mindset.

The most used case of SGD is in Machine Learning (or Deep Learning), where it is used to search for global function minimum. Namely, to search for the global minimum of the Cost function.

Note: *The algorithm does not always find the global minimum but that is not in the scope of this article.*

# Gradient Descent

The simpliest case of SGD:

- We have a function
**C**. **C**takes in one input**x**, like**C(x)**.- We can change only
**x**and we know the derivate**C’ = dC/dx**. - The task of the algorithm is to find the smallest value (minimum) that
**C**can output at point**p**. So, what is p in**argmin(C) = p**.

## Visualization

Let’s say we have the simple function of **C(x) = x²**, our derivate would be then **dC/dx = 2x**. We call this function ‘cost’:

Then we choose a random point **p **called the gradient point, which the algorithm will use as a starting point. On this plot, **p **can be seen as a red triangle, and I denoted the global minimum (the objective) with a green dot at the bottom of the function:

Then we must determine some hyperparameters of the algorithm before running it.

- How many epoch’s to take? (No. of optimization runs)
- What should be the learning rate? (
**η**(eta), is a small number in**[0;1]**which will determine how big our algorithm should step in the direction of the found minima. A too big**η**can cause the SGD to step through the found minima)

And that’s it! We are ready to see some code!

So, we first choose the **p **point to start from.

Then we define the optimization step:

Mathematically the following happens:

- Let say that we have a random number from the domain of
**C**, called**x**. - Then we define
**y = C(x)**. Together they represent the point**p(x,y)**. - Then for each epoch:

3.1. Say that the change of**x**is**dx = -ηdC(x)/dx**.

3.2. Then the new**y**is**y = C(dx)**3.3 Change x for the next loop**x = x + dx**. - After each epoch was finished, the algorithm arrived at the points
**x, y**.

If we run the algorithm with **epoch=15**, and **η=0.1**, then on this plot you can see the change of **x** (red triangles)** **over time:

As you can see the gradient point was trying to get to the minimum point of the function (green dot), so the algorithm was correct!

# Gradient Descent with two Variables

*What if the function has more than one input variable? What if the function has a vector input?*

Then we must change the SGD according to the number of variables. The algorithms I choose to show in this article are only for the one and two-variable cases, so the general algorithm is not in the scope of this article.

The general algorithm involves linear algebra and is mostly based on matrix operations.

Then the function and its two partial derivatives would look like this:

## Differences between the two- and one-variable cases

The starting case for SGD is mostly the same, the changes can be seen here:

- The cost function
**C(x)**changed to**C(x1, x2)**. - The domain is changed to be not a line of
**x**, but a surface of**x1**and**x2**. - We must know two partial derivatives:
**dC/dx1**and**dC/dx2**.

That’s all the changes we needed to make, then here is the code:

Changes compared to the one variable case:

- We initialize two random variables from the
**x1**and**x2**domains of**C**. - Then in the optimization steps:

- Instead of just getting**dx**, we must get**dx1**and**dx2**, so one change for each input**C**have. Then**dx1 = dC(x1)/dx1**and**dx2 = dC(x2)/dx2**.

## Visualization

When placing down the first random variable (starting points):

Where again, the green dot represents the minima, and the red triangle represents the gradient’s first point.

Then after appling **epoch=15** and **η=0.1**:

From above the function:

As you can see the algorithm for two variables also works, we just needed to make a few changes in order to optimize two inputs instead of one.

# Epiloge

I hope I helped you to understand SGD better, or made you interested in the whole area of maths and computer sciences.

All the code in this article could be found here: https://github.com/MartinKondor/Notebooks

Notebook of the one variable case: https://github.com/MartinKondor/Notebooks/blob/main/gradient-descent-2d.ipynb

Notebook for the two variable case: https://github.com/MartinKondor/Notebooks/blob/main/gradient-descent-3d.ipynb

**Author**: Martin Kondor.

## Sources

- Notebooks & Visualizations — https://github.com/MartinKondor/Notebooks
- Machine Learning — https://github.com/MartinKondor/MachineLearning