An Intuitive Approach to Linear Regression
Note, this article focuses more behind the mathematics of linear regression. You will need to understand the basics of partial derivatives and some linear algebra. If you are not too comfortable with these topics, bookmark this article, watch some Khan Academy videos and then have a read. You’ll thank me later.
Have you ever felt like you never properly understood gradient descent, regression or loss functions? Did you brush these concepts aside and focus all on the coding? Yeah, me too.
To understand what’s going on behind the algorithms, we must consider the math. I’ll be honest; machine learning mathematics is difficult. However, once we break down each concept and take a step-by-step approach, it will feel like a new world of understanding just emerged!
Today we will look at a simple linear regression model to grasp all of the math going on behind the scenes. Let’s get started, shall we?
What is Linear Regression
Regression is a form of predictive analysis that examines the relationship between one or more independent variables to a dependent variable. It’s basically like functions: some value of x is inputted, that value is manipulated through certain coefficients like a slope or intercept, and finally another value is outputted.
Linear regression is a technique that is used when the shape of the dataset best resembles a straight line.
There are several use cases for linear regression. For instance, it can be used to predict the sales of a product, pricing, performance, or the salaries of prospected employees.
In fact we will learn about one of its applications today with a challenge proposed by Toby Flenderson of Dunder Mifflin.
Toby is having a hard time deciding the perfect employee to hire after meeting several people at a job fair. To gain some more insight, Toby asked us to use machine learning to help him predict the salaries of various employees. He has provided us a dataset consisting of the years of experience and salary for previous employees at Dunder Mifflin.
Our goal for this challenge is to find the equation of a line that will best fit the datapoints so that we can make accurate predictions of the salaries for new employees. Let’s take a step by step approach and see how we can solve this problem.
1. Examine the Dataset
Since we are only dealing with one feature (years of experience), we will be able to visualize our graph on a two dimensional plane.
We can clearly see that the data points follow a linear fashion. For this reason, linear regression is the best model to use for this dataset.
Unfortunately, most datasets usually won’t consist of only one or two features. Instead, it will include several features (x1, x2, x3…xn), thus turning the problem’s scope into a multi-dimensional one. For this reason, it becomes difficult to visualize the dataset. You will find yourself skipping this step most of the time since we cannot visualize anything past 3D.
2. Define your model
Remember y = mx + b? Yes, this equation will be the basis for our model with only a minor change in the name of the variables: the slope coefficient is usually captured in the matrix/vector called W and the intercept as the bias.
Why is W a vector you might ask? If our dataset had multiple features, then multiple weights will be assigned to those features. For this reason, all of the weights are usually captured in a vector to make the equation more concise. For example, the linear equation for a dataset with two features in long form will look like this:
But because Toby gave us only one feature (years of experience), we will leave w as a scalar value since there is only one weight value to be multiplied.
3. Define a Loss Function (MSE)
Ok we now know the general equation for a line. But we have no idea what weight and bias will give us the best fitting line. How can we find that?
What if we tried setting a completely random weight and bias for our line? What if there was a way to calculate the difference between the actual point on our dataset and the predicted points on our line? Well, there is! Behold the loss function:
A loss function allows us to calculate the error between our current model to the actual points in our dataset. In other words, it tells us how poorly our model is performing.
For Linear Regression, The Mean Squared Error (MSE) loss function is used. Essentially, the MSE measures the average of the squared residuals (the difference between actual and predicted values). To get a more intuitive understanding, let’s dive deeper into what each variable means.
- Y = the actual data point
- ŷ (pronounced y hat) = the predicted data point
- n = the total amount of data points
The MSE function operates as follows:
- calculates the sum of the Euclidean distances between the actual (Y) and predicted value (ŷ)
- squares the distance in order to get rid of any negative signs
- divides the sum of squares by the number of elements in our dataset (this step is done due to machine learning convention)
The key idea to note here is that the difference between Y and Y hat will give us a quantitative measure for how bad our current model is performing.
Let’s see our MSE function in action! I’ll pick 3 for our random weight value and 4 for our random bias value to calculate the MSE.
Yikes! Toby will not be happy with this model. As you can see the error we received is extremely large (6.5 * 10⁹). It makes sense since the distance between the points on the line and the actual points vary greatly. But this is a decent start.
The next question we should ask is that how can we lower our error value? It would make sense to choose the weights and biases gives us the least error possible right?
Some calculus minds might be jumping onto the idea of taking the first derivative test, but hold your horses. Let’s take this idea a bit farther into the next step. Let’s talk about optimization.
4. Apply Gradient Descent (GD)
Ah! A concept that seems incredibly daunting to the untrained eye. But don’t worry, we will go through each step slowly to ensure we get maximum understanding.
Gradient Descent is an iterative optimization algorithm that finds the local minimum of a differentiable function. In other words, Gradient Descent will find a local minimum in our MSE Loss Function.
Why do we need to find a local minimum? That’s because it will tell us the optimal weight and bias used to acquire a low error value. We will then use that weight and bias in our final equation for our line of best fit.
Right now let’s focus again on our error function. Remember quadratics? let’s say we have this function:
Doesn’t this look a lot like our MSE function?
Aha! The MSE function indeed has a quadratic shape. This will allow us to perfectly visualize the process of gradient descent.
Think of a ball rolling down a hill. After the ball reaches the valley, it stops moving. Gradient descent is all about finding this sweetspot in this valley which would give us the minimum MSE value. But what if we couldn’t visually see the graph due to high dimensionality? Then, all you have to consider is the sign of the gradient.
- If the gradient is positive at the point where the ball lies, then it must move left to get closer to a valley
- If the gradient is negative, then the ball must move right to get closer to the valley
Do you remember how we find the slope of a non-linear graph?
For those who are not familiar with derivatives yet, all you need to know for right now is that derivatives allow us to find the slope of a non-linear graph at any instantaneous point. After finding the derivative, we will check for where the slope is zero and that will signify that we reached a local max/min. In our case we are searching for a local minimum in our MSE function since we are trying to minimize our error.
You might be asking, why can’t we directly find the global minimum through the first derivative test? What if we get stuck in a valley that is not the global minimum? Why move inch by inch through gradient descent?
If we had constant weights and biases, then finding the absolute minimum would be a no brainer. However, because our weights and biases are constantly changing, it’s extremely difficult to find it. But ninety percent of the time, finding a local minimum does the job well, unless the error is significantly large.
To find a local minimum, we must first differentiate our MSE function with respect to the weight and the bias.
Differentiate with respect to w:
Hint — use the chain rule and constant rule.
Next apply the same process for differentiating with respect to b
Great now we can compute the gradients by plugging in the values for Y and Y^!
5. Update the weight and bias (Still GD)
After finding the gradients, we will use the following two equations to update our weight and bias:
- Why do we subtract?
Because subtracting guides the model in the direction of the local minimum.
- Remember if the gradient is (+) we go left (-)
- if the gradient is (-) we go right (+)
2. L refers to the learning rate.
To understand the learning rate, let’s analyze a man climbing down a mountain.
When the man is steeply approaching downhill, he takes bigger steps to reach the bottom faster. However, as he approaches closer to the valley, he decreases his step size to ensure that he doesn’t overstep and end up on the other side of the mountain.
What we basically observed is that the man’s speed is proportional to the slope of the mountain.
This part is key for understanding the learning rate.
Usually the learning rate is defined to be a small number (e.g 0.001) to ensure that the regression model doesn’t overstep and create more error when adjusting the weights and biases. And since L is multiplied to the gradients, the behavior of adjusting the weight and bias will be similar that of the man climbing down this mountain.
As you can see, because we had both negative gradients for w and b, our adjusted weight and bias are greater than the original ones (shifting right). This is a clear indicator that our model is headed in the right direction.
This process was one iteration of gradient descent. However, we are not done yet. Our model only improved slightly, but it’s still garbage. We want to make Toby proud not more depressed. There is one more thing we must do.
6. Perform Multiple Epochs
An epoch is the machine learning definition of iterating through the whole dataset, calculating the loss function, and performing gradient descent.
Since Gradient descent works in small stages, we will need multiple epochs to start seeing noticeable improvement in our linear regression model.
Let’s apply gradient descent 10 more times!
Nice! Looking better. How about 15 more times?
Wow! Ok lets bump it up to a 300 epochs (By the way don’t try this by hand. Use code)
Jackpot! This looks like the line of best fit.
To ensure we have the least error, let’s use our MSE function again.
Our error decreased from 1.5 * 10⁹ to 3,127,096. That is a 99.7% decrease! Even though this value might still look large, keep in mind the scale of the salary axis and the overall spread of the points. If the actual points were packed even more closer to each other, we would receive an even smaller error.
You might ask, what if we keep increasing the number of epochs? This is possible, but after a certain point the decrease in error will become extremely small and insignificant. It could also lead to overfitting or even increasing our error (overstepping). For our model, the error we received is completely acceptable. Our model has reached a local minimum.
Here we visually observe the process of running multiple epochs:
7. Choose the Optimum Weight and Bias
After running multiple epochs and acquiring an optimum error value, this is a good indicator that the current weight and bias will provide us the line of best fit for our model. Let’s find out what they are!
- w = 9449.9623
- b = 25792.2002
Therefore, the final equation for our model is…
There you have it! Toby can now use this linear regression model to accurately predict the salary of a prospected employee based on his/her years of work experience.
Toby will be extremely proud!
- Linear Regression is all about finding the equation of a line that will best fit the given linearly shaped dataset
- we start with random values for our weights and biases, use the loss function to measure the error, and gradient descent to reduce the error
- Linear Regression helps us understand two of the most fundamental concepts for majority of algorithms in machine/deep learning: the loss function and gradient descent
- Machine learning isn’t a miracle after all. It’s a series of complicated math equations followed by finding the local minimum using some calculus
- Understanding the intuitive process of ML models allows us to use logic to arrive at solutions instead of just memorizing a bunch of code
If you enjoyed the article or learned something new, leave some claps, connect with me on LinkedIn, and see what I’m up to in my monthly newsletter.