Published in



Gradient Descent Visualization

Visualize SGD optimization algorithm with Python & Jupyter


In this article I am aiming to provide a good visual perspective to understand the Stochastic Gradient Descent (SGB for short) algorithm, this will hopefully give a good understanding to You.

Some of the math is represented in this article, but I advise You to read further on other materials for a deeper understanding.

Visualizations were made in Python, with the help of Jupyter Notebook, Matplotlib, Numpy libraries.

Have fun!

Use cases

SGD is mainly used in math and engineering sciences. I plan to approach SGD with a Computer Scientist (Machine Learning) mindset.

The most used case of SGD is in Machine Learning (or Deep Learning), where it is used to search for global function minimum. Namely, to search for the global minimum of the Cost function.

Note: The algorithm does not always find the global minimum but that is not in the scope of this article.

Gradient Descent

The simpliest case of SGD:

  • We have a function C.
  • C takes in one input x, like C(x).
  • We can change only x and we know the derivate C’ = dC/dx.
  • The task of the algorithm is to find the smallest value (minimum) that C can output at point p. So, what is p in argmin(C) = p.


Let’s say we have the simple function of C(x) = x², our derivate would be then dC/dx = 2x. We call this function ‘cost’:

Cost function & it’s derivative plotted

Then we choose a random point p called the gradient point, which the algorithm will use as a starting point. On this plot, p can be seen as a red triangle, and I denoted the global minimum (the objective) with a green dot at the bottom of the function:

Cost function, it’s gradient point and it’s global minimum

Then we must determine some hyperparameters of the algorithm before running it.

  • How many epoch’s to take? (No. of optimization runs)
  • What should be the learning rate? (η (eta), is a small number in [0;1] which will determine how big our algorithm should step in the direction of the found minima. A too big η can cause the SGD to step through the found minima)

And that’s it! We are ready to see some code!

So, we first choose the p point to start from.

Then we define the optimization step:

Mathematically the following happens:

  1. Let say that we have a random number from the domain of C, called x.
  2. Then we define y = C(x). Together they represent the point p(x,y).
  3. Then for each epoch:
    3.1. Say that the change of x is dx = -ηdC(x)/dx.
    3.2. Then the new y is y = C(dx)
    3.3 Change x for the next loop x = x + dx.
  4. After each epoch was finished, the algorithm arrived at the points x, y.

If we run the algorithm with epoch=15, and η=0.1, then on this plot you can see the change of x (red triangles) over time:

Change of the gradient values over time

As you can see the gradient point was trying to get to the minimum point of the function (green dot), so the algorithm was correct!

Gradient Descent with two Variables

What if the function has more than one input variable? What if the function has a vector input?

Then we must change the SGD according to the number of variables. The algorithms I choose to show in this article are only for the one and two-variable cases, so the general algorithm is not in the scope of this article.

The general algorithm involves linear algebra and is mostly based on matrix operations.

Then the function and its two partial derivatives would look like this:

3D Cost function with the two partial derivative functions

Differences between the two- and one-variable cases

The starting case for SGD is mostly the same, the changes can be seen here:

  • The cost function C(x) changed to C(x1, x2).
  • The domain is changed to be not a line of x, but a surface of x1 and x2.
  • We must know two partial derivatives: dC/dx1 and dC/dx2.

That’s all the changes we needed to make, then here is the code:

SGD with two variables

Changes compared to the one variable case:

  • We initialize two random variables from the x1 and x2 domains of C.
  • Then in the optimization steps:
    - Instead of just getting dx, we must get dx1 and dx2, so one change for each input C have. Then dx1 = dC(x1)/dx1 and dx2 = dC(x2)/dx2.


When placing down the first random variable (starting points):

SGD first step

Where again, the green dot represents the minima, and the red triangle represents the gradient’s first point.

Then after appling epoch=15 and η=0.1:

Change of gradient point

From above the function:

Gradient descent from above the function

As you can see the algorithm for two variables also works, we just needed to make a few changes in order to optimize two inputs instead of one.


I hope I helped you to understand SGD better, or made you interested in the whole area of maths and computer sciences.

All the code in this article could be found here:

Notebook of the one variable case:

Notebook for the two variable case:

Author: Martin Kondor.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store