Mastering Gradient Descent: Optimizing Neural Networks with Precision.

9 min readMar 10, 2024

Part 2: Diving Deeper into Optimization Landscapes.

In the context of gradient descent, moving towards a negative gradient or away from the gradient of the function at the current point leads to a local minimum, while moving towards a positive gradient or towards the gradient of the function at the current point leads to a local maximum. Let’s break it down:

In terms of the gradient, a local minimum is a point where the gradient is zero or decreases as we move away from the point. This is because the gradient measures the slope of the function, and at a local minimum, the function’s slope is flat or decreasing. Moving in the direction of the negative gradient, we are effectively moving towards this flat or decreasing slope, hence reaching a local minimum. If the function is convex, moving towards the negative gradient will indeed lead the algorithm towards a local minimum. This is because in convex functions, any local minimum is also the global minimum.

In terms of the gradient, a local maximum is a point where the gradient is zero or increases as we move away from the point. This is because the gradient measures the slope of the function, and at a local maximum, the function’s slope is flat or increasing. Moving in the direction of the positive gradient, we are effectively moving towards this flat or increasing slope, hence reaching a local maximum.

let’s consider a simple function f(x) = x², which is a convex function.

Moving Towards a Negative Gradient: Let’s say we start at x = 2. The derivative of f(x) at x = 2 is f’(2) = 2*2 = 4. This is the slope of the function at x = 2, and it tells us that the function is increasing at this point. If we move in the direction of the negative gradient (i.e., we subtract a small amount from x), we move towards x = 0, which is the local (and global) minimum of the function.
Moving Towards a Positive Gradient: Now let’s say we start at x = -2. The derivative of f(x) at x = -2 is f’(-2) = 2*(-2) = -4. This tells us that the function is decreasing at this point. If we move in the direction of the positive gradient (i.e., we add a small amount to x), we move towards x = 0, which is the local (and global) maximum of the function.

To emphasize, the best way to define the local minimum or local maximum of a function using gradient descent is as follows:

If we move towards a negative gradient or away from the gradient of the function at the current point, it will give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the current point, we will get the local maximum of that function.

Working of Gradient Descent in Machine Learning. Reference

At the start of the algorithm, we typically initialize the parameters at some arbitrary point. The cost function is usually steepest at this point, meaning it has the highest slope. As we update the parameters and move down the cost function in the direction of steepest descent (i.e., the negative gradient), the slope gradually becomes less steep.

Eventually, as we get closer to the minimum of the cost function, the slope approaches zero. This is because at the minimum point, the cost function is flat, and so the slope or gradient is zero. This point is known as the point of convergence, and it represents the optimal parameters that minimize the cost function.

The Gradient Descent algorithm requires two main properties for the function it optimizes: differentiability and convexity. That is, for the Gradient Descent algorithm to work effectively, the function needs to be differentiable and convex.

A function is said to be differentiable at a point if it has a derivative at that point. If a function is differentiable it has a derivative for each point in its domain. In other words, for every point in the domain of the function, there exists a tangent line that passes through that point. The derivative of a function provides the slope of this tangent line at that point. This means that the function must be smooth (i.e., not have any holes, jumps, or sharp turns) around that point. If a function is not differentiable, derivatives do not exist, and traditional Gradient Descent cannot be applied.

To learn more, consider checking the information available in the provided links: Differentiable

Examples of differentiable functions. Reference

The first graph represents a quadratic function f(x) = x². The derivative of this function is f’(x) = 2x, which means the slope of the function at any point x is 2x. This function is differentiable because it’s smooth everywhere and its derivative exists for all x. The second graph depicts a trigonometric function f(x) = 3sin(x). The derivative of this function is f’(x) = 3cos(x), which represents the rate of change of the sine wave. This function is differentiable because it’s smooth and continuous, and its derivative exists for all x. The third graph illustrates a cubic function f(x) = x³ — 5x. This curve crosses through the origin and has one peak and one trough within the visible range. The derivative of this function is f’(x) = 3x² — 5, which gives the slope of the function at any point x. This function is differentiable because it’s smooth everywhere and its derivative exists for all x.

Examples of non-differentiable functions. Reference

The first graph represents the function f(x) = x/|x|. This function is not differentiable at x = 0 because it has a jump discontinuity at this point. The function abruptly changes value from -1 to 1 as x crosses zero. Since the function isn’t smooth at x = 0, its derivative doesn’t exist at that point. The second graph depicts the function f(x) = sqrt(|x|). This function is not differentiable at x = 0 because it has a cusp, a point where the graph has a sharp turn, at this point. The derivative of a function at a certain point gives the slope of the function at that point, and in the case of a cusp, the slope isn’t defined because the function isn’t smooth at that point. The third graph illustrates the function f(x) = 1/x. This function is not differentiable at x = 0 because it approaches infinity as x approaches zero from both sides, resulting in an infinite discontinuity. Since the function isn’t defined at x = 0, its derivative doesn’t exist at that point.

Next requirement — function has to be convex. Convexity means that the function has a single global minimum and no local minima. Roughly speaking, a function is convex if the line segment between any two points on the function lies above or on the graph of the function (it does not cross it). In a convex function, any local minimum is also a global minimum. This is because a convex function curves upwards, so any point where the function stops decreasing and starts increasing (a local minimum) must be the lowest point on the function (a global minimum). For example, in the function f(x) = x², the point x = 0 is both a local and global minimum. Non-convex functions can have multiple local minima and maxima.

However, not all functions are both differentiable and convex. For example, the function f(x) = x⁴ is convex, but it is not differentiable at x = 0 because the left-hand and right-hand derivatives at this point are different.

Please refer to the provided links for more in-depth information: Concave Upward and Downward

To understand the difference between local minima and global minima, take a look at the figure below. In a graphical representation of a function, the global minimum is the lowest point overall, while local minima may appear in various regions of the graph.

The global minimum of a function is the absolute lowest value that the function takes over its entire domain. It represents the lowest point of the entire function, considering all possible input values. The global minimum is the point where the function has the smallest output compared to all other points in its domain. There is only one global minimum for a well-behaved, continuous function.

Although this function does not always guarantee to find a global minimum and can get stuck at a local minimum. In optimization problems, finding the global minimum is often the primary objective, as it corresponds to the best possible solution in the entire solution space.

A local minimum is the smallest value of a function in a specific neighborhood around a particular point. It is a point where the function is lower than its nearby points but not necessarily lower than all points in the entire domain. A function may have multiple local minima, each corresponding to a minimum within a specific range of input values. Local minima are optimal within the specific neighborhoods in which they occur. However, they may not be optimal globally.

For a more comprehensive understanding, you can explore the details available in the following links.

Second Derivative

Finding Maxima and Minima using Derivatives

Let’s consider a different example to illustrate the concept of Gradient Descent using the analogy of finding the lowest point in a valley when you are in a dense fog.

Imagine you’re standing in a valley surrounded by hills and you are tasked with finding the lowest point. Due to the dense fog, you can’t see the entire landscape, so you don’t know where the absolute lowest point is from just looking around. However, you can feel the slope of the ground beneath your feet.

You start by randomly choosing a direction to move in. Here’s how you might proceed:

Feeling the Slope: You feel the ground to determine the direction in which the ground slopes downward.
Taking a Step: You take a step in that direction because it will lead you downwards.
Reassess and Repeat: After each step, you feel for the slope again and take another step in the direction that leads downwards.
Reaching a Flat Point: Eventually, you find a spot where no matter which direction you step, the ground slopes upwards. This suggests you are at a local minimum.

Now, let’s say that after a while, you find that the slope leads you back uphill. You may think you’ve found the lowest point, but you might actually be in a small dip or a local minimum rather than the valley’s global minimum. In the dense fog situation, without being able to see the big picture, it’s hard to tell.

To ensure you find the global minimum, you might:

Explore More Broadly: Instead of stopping at the first flat spot, you explore further in various directions to see if there’s an even lower point beyond the small dip you found.
Start from Different Places: You might start your “walk” from different points in the valley to see if you end up at a different low point.

This process is akin to what happens during gradient descent in machine learning:

You start by randomly choosing a direction to move in. This is equivalent to initializing the parameters of a machine learning model with random values.
The current position is the current values of the parameters of the model.
The slope of the ground is the gradient of the loss function with respect to the parameters.
The steps you take are the updates to the parameters in the opposite direction of the gradient, scaled by the learning rate.
The local minimum represents a point where the model’s performance is better than in the immediate vicinity, but it might not be the best possible model.
The global minimum represents the best possible model with the lowest loss on the entire landscape.

Just like in the foggy valley, in a complex loss function landscape, it’s possible to get stuck in a local minimum and mistakenly take it for the global minimum. At the end of this article, we ‘ll see how to solve this problem.

Closing note — As we conclude Part 2, you’ve ventured deeper into the intricate world of optimization. Let these insights kindle your enthusiasm for mastering machine learning’s nuanced techniques. Stay engaged, stay inquisitive. Part 3 beckons with more revelations. Until then, keep exploring, keep evolving, and let’s unravel the mysteries of machine learning together!

Mastering Gradient Descent: Optimizing Neural Networks with Precision.

Written by om pramod