Optimization in Machine Learning — Part 1

No matter what kind of Machine Learning model you’re working on, you need to optimize it, and in this blog, we’ll learn how exactly optimization works.

Abhishek Chatterjee
The Startup
6 min readSep 13, 2020

--

Optimization in Machine Learning is one of the most important steps and possibly the hardest to learn also. The optimizer is a function that optimizes Machine Learning models using training data. Optimizers use a Loss Function to calculate the loss of the model and then based on that tries to optimize it. So without an optimizer, a Machine Learning model can’t do anything amazing.

In this blog, my aim is to explain how optimization works, the logic behind it, and the math behind it. I’ll not explain/provide any code. If you’re looking for a Mathematical/Logical explanation, only then continue.

This is the first part of these series of blogs on Optimization on Machine Learning. In this blog, I’ll explain optimization in an ultra-simple way with a stupid example. This is specifically helpful for absolute beginners who do not have any idea how optimization works.

As I mentioned earlier, the optimizer uses a Loss Function to calculate the loss of the model, and then based on that the optimizer updates the model to achieve a better score, so let’s understand Loss Function first.

So what the hell is Loss Function?

Loss Function (also known as Error Function, Cost Function, Energy Function) is a function that calculates how good/bad a Machine Learning model is. So if you train a model, you can use a loss function to calculate the error rate. If the error is 0, then your model is perfect.

In real-world projects, it is impossible to achieve error 0, so the aim is always to achieve something that is close to 0.

How to calculate it?

There are several ways to calculate the loss of a model using some Loss Functions. As this blog is on Optimizer, I don't want to spend too much time of Loss Functions, but basically it uses the predicted values from the model, the actual values for that input, and then perform some calculations to find the error rate.

Mean Squared Error

One popular way to calculate the error rate of the model is called MSE or Mean Squared Error. In Mean Squared Error, we calculate the mean of the squared difference between all the predicted values and actual values for all inputs. The mathematical formula is given below.

The formula for Mean Squared Error

MSE basically outputs a number. The lowest that we can achieve is 0. The output is 0 or always greater than zero.

Now as we have some understanding of Loss Functions, let’s jump into optimization.

Optimization Begins

Once we have the Loss Function, we have a measure to tell how good/bad a model is, and the optimizer now can reduce the error rate from the Loss Function, and hence optimize the model.

This part of this blog is heavily inspired by the YouTube video of Brandon Rohrer. Please check his video on YouTube for better explanation.

Let’s introduce a refreshing problem

Let’s say we’re making tea. The recipe for making tea is very simple, we just need to boil some water, add some tea, some milk, and some sugar, and that’s it, we have some tea.

Now let’s focus on the sugar part. If we add too much sugar to the tea, then it tastes bad, and on the other hand, if we add very little sugar on it, then it also tastes bad. If we add the perfect amount of sugar to it, only then it just tastes perfect👌.

Based on this if we plot a graph, where the X-axis represents the amount of sugar, and the Y-axis represents how bad the taste of the tea is, then it looks something like this. Here in the graph, the lowest point is the sweet spot.

The X-axis represents the amount of Sugar and the Y-axis represents how bad the taste is

Now as I’m stupid, I’m not going to ask my mom about how much sugar to add on tea, so instead, well I decided to write a Machine Learning based solution that can tell me how much to add for that “perfect tea”.

Let’s optimize it

Here the goal is basically to find to amount of sugar needed to make the perfect tea. So the job of the optimizer here is to find the optimal amount of sugar needed for the “perfect tea”.

There are different ways to optimize a problem, the first and possibly the most inefficient way to do this is by using something called “Exhaustive Search

So In the Exhaustive Search algorithm, we basically just look at the lowest point on the graph.

However, in real-world situations, we don’t have a graph like that. So that means, we need some data, or in other words, I need to make a lot of tea, measure the amount of sugar that I added, ask my virtual girlfriend😭 to taste it, and then store her feedback.

For real-world problems, Loss Functions are used here. As I explained earlier, Loss Functions are the function that tells how good/bad our model is. So in this case, it can tell how bad/good the tea is.

So after doing this step to hundreds of times, I have the amount of sugar on one column and feedback (some type of score for the tea, for some weird reason, a higher score means the tea is awful) on the second column.

Once I have the data, the Exhaustive Search algorithm can scan through the data and find the optimal solution, or in simple words, how much sugar to add on tea.

This Exhaustive Search algorithm is very simple and robust, but on the other hand, computationally very expensive as we have a check all possible solutions to find the optimal solution. The complexity is so high that for many real-world problems we just can not use this algorithm.

The other possible way to optimize is by using Gradient Descent. Gradient Descent uses Partial Derivatives to calculate the slope at any point in the curve, and then based on that, it changes the amount of sugar to find the optimal solution(best tea).

Currently, in this industry, Gradient Descent is a technique used most often, so to understand it properly, let’s dive a little deep.

If you want to learn Gradient Descent with some mathematical details then, please jump into the second part of the blog, here I’ll explain it in an ultra-simple way without too much maths.

Gradient Descent in simple words

In the graph above, let’s pick any random point. Here in the graph, see there is a point, so for making tea, I’ll add that amount of sugar(based on the point in the graph, remember the X-axis represents the amount of sugar) into the tea.

The X-axis represents the amount of Sugar and the Y-axis represents how bad the taste is

Once I make the tea, well, I can ask my virtual girlfriend😭 to taste it, based on that I can store the feedback.

Now let’s pick more random points but around the old point. Again, I’ll make the tea, add the amount of sugar needed, and finally check the quality of the tea.

The X-axis represents the amount of Sugar and the Y-axis represents how bad the taste is

Now based on the data, now I can make check which tea is best and I can make assumptions on what direction related to my first selected point, to make better tea.

Clearly, based on the graph, we need to go down (remember the lowest point is the sweet spot).

Now we can repeat this several times to reach very close to that sweet spot and hence make the perfect tea.

Happy Ending

The explanation of Gradient Descent is extremely simple here. To make it extremely simple to understand, I’ve skipped several important concepts.

This is part one in the series of blogs on Optimization on Machine Learning. So in the next part, I’ll explain Gradient Descent with a real-world example with the complete deprivation and all the important topics like Learning Rate, Momentum, etc.

Currently, there is no second part of this blog, I’m still writing the second part, so once I have it, I will put a link here.

--

--

Abhishek Chatterjee
The Startup

Creator of Rocket | Senior Software Engineer | @imdeepmind | imdeepmind.com