Linear Regression — Everything you need to know

Rohit Modi
Analytics Vidhya
Published in
9 min readAug 11, 2021

Whenever we talk about Linear Regression we always talk about finding the best fit line for the data, well, That exactly is the objective of Linear Regression, but there’s more to it than just fitting the line, So let’s talk about why and how we find this best fit line.

Why there is a need to find the best fit line?

As the name suggests, the algorithm works on data that follows a linear trend, thus if we can find a line that could correctly define the trend of the data, we can very likely use the same line to define the whole dataset, thus using the same line, we can even get the values at points that are not present in the dataset, this is called prediction.

The next question is, How do we find this best fit line?

Take a look at the dataset below, it is pretty clear that it does follow a linear trend.
So just by looking at the dataset can we find what will be the output at any point, Yes we can, But real-world data is not this easy to interpret, in these cases, we can rely on the underlying Mathematics of the Algorithm to find the line that could understand the trend in the data and adapt to it.

Figure 1

Let’s take the data shown in figure 1. There is a linear relationship between X and Y, and if the relationship is linear there must be a line that defines the relationship. As we all know, the equation of a line is Y = M*X + C, We can easily correlate this equation with our Figure1, where X is the dataset and Y is the target Variable (the value of which we have to find), But what is M and C, well, these are the parameters used to create a line, where M is the Slope of the line i.e., the inclination of the line with the X-Axis and C is the intercept i.e., the point at which the line crosses the Y-Axis.

Our objective is to find the line that best describes the data, which can be done using the data we have on hand, we use this data to find the trend, then get a generalization of this trend using a line, and use this line to make predictions for unseen values of X.

Let’s understand how the algorithm starts.
The first thing that the Algorithm does, is output a randomly generated line i.e., to output some random value for M and C and use these values to create a line.

Now we Measure, How well does the line defines the data or fits the data.

Turns out, it will perform pretty badly in describing the relationship between X and Y. But how bad or good does the line performs. To measure this we need a numerical value that could tell how good or bad the line fits the data.

Let's say we find the distance between the actual data point and how far is the line from that data point for some value of X, simply put, calculate the distance between them, if the distance is less then we can say that the line defines the data pretty well.

But in what terms do we need to find this distance, Well, our objective is to be able to predict a value of Y for some value of X, let's use the same, Let's find the Y coordinate of the line and the data point and find the distance between these two. Based on this logic we can write the equation as follows:

Where Y-Hat is the point on Line and Y is the actual data point for some value of X, This equation can also be termed as Error.

But this just describes one data point, what if for the other data point the line is very far, Let's say we compute the same for all the data points and take a sum of the output and take the mean of this value, we can get a number describing the Combined error of the system that we have defined. Thus obtaining the equation below where n is the number of data points:

Now, as we see the objective is to fit the line to the data, thus the line will be passing through the data, in this case, many of the data points will lie below the line and these will give us a negative value for (Y-Yhat) calculation, and if we sum up all the Positive and Negative values there is a chance that the result will be very close to 0, wait, what! this means there is very little error, and the line is pretty good which is totally not the case. We need to handle this.

Let's do one thing, before adding the individual error values, we take the Square of them. Thus, our equation will become:

Now we will not get any negative values, But why did we take a square and not just a modulus of the error, Because this way we will get a higher value of the error term when there are data points that are very far from the line, thus adding more penalty if the Line is very far from the Point.

So, We now have a perfect term that describes the Combined error of the System, We can call this term a Loss Function.

Now, it is very clear that if the line fits the data properly, the error term will be very less, So if we somehow find a way to decrease the error we might find the perfect line. This exactly is our next objective.

But how can we find, how the error term relates to the line. Can we say that the line is totally dependent upon M and C, and if we decrease or increase M and C the line changes, and if this is the case, there must exist some value of M and C for which we will get a line with the lowest error? But how will we know what values of M and C are optimal, To answer this, let us plot the loss function w.r.t M.

Well, we can clearly see that at point A the Error will be minimum for Some value of M. So we need some way to find this particular value of M, But how?

Here Differentiation comes into the picture, We know that differentiation of a curve with degree 2, at any point, will give us a Tangent to that curve, and using the tangent we can find if the curve is increasing or decreasing at that point, i.e., if the tangent has a negative slope, we are going downwards on the curve, and if the slope of the tangent is Positive, we are moving upwards on the curve.

Our aim is to move towards the minimum point on this curve which will eventually lead us to the optimal value of M, in other words, if we change the value of M accordingly based on where we are on the curve we will eventually move towards this point of minimum Error.

Great, we are able to find the relationship between the error and the Slope M.

So let's find the partial derivative of the Error Equation with respect to M. Before that, let's write the error term w.r.t M and C.

We can use the value of Yhat in the Loss function.

If we find a partial derivative of the above equation w.r.t M we will get:

Which can be re-written as:

Tada!! We found the derivative, now let's use this derivative value to increase or decrease the value of M.

We know that negative differential (tangent) value means we are on the right path to reach the minimum point on the curve and we need to move downwards on the curve while positive differential value means we are going away from the minima and going upwards on the curve thus we need to move toward left and downwards on the curve. This means if we have a generated a line using some value of M and the corresponding differentiated value for the Loss obtained is Negative, that means we are on the right path and we just need to increase M so as to move towards the right and downwards in the curve, Similarly if we get Positive value for differential of the Loss, we need to decrease the value of M so as to move towards the left and downwards on the curve. Keeping this in mind let's create the following equation that could help us change the value of M.

Here you can see an extra term — Alpha, why do we need it, Turns out that the derivative value will be considerably large, thus if we subtract or add it to M, we might get a value for M that is again far away from optimal M, i.e., it will be a big change in the value of M, Instead of this, let's change the Value of M very slowly, or restrict the rapid change in M by multiplying some very small value, Alpha, with the derivative.

The Alpha here is referred to as Learning Rate.

Similarly, we will find the value for C as well, Only, there is a slight change in the derivative equation as shown below.

The value of C will be updated in the same way we updated the value of M, i.e., using the below equation:

The above process of finding the derivative of the error function and changing the value of the parameter based on that is called Gradient Descent.

If we repeat the above process several times we can slowly change the value of M and C and will eventually obtain an optimal Value. The line created using these values will have the lowest error and is called the best fit line. Now we can use this Line to make predictions.

Best Fit Line

Great, we now know how the linear regression algorithm works.

To summarize the algorithm:
1. Randomly initialize M and C
2. Create a random line using these values of M and C.
3. Calculate the Error (Loss) for the line.
4. Use the derivative of the error term w.r.t. M and C and subtract it from M and C respectively, Use Alpha (learning rate) to control this change.
5. Create a new line using new values of M and C.
6. Repeat steps 3 through 5 until you get a minimum error.

Check out the code to implement the Linear Regression Algorithm in Python.

--

--