The “Best” Linear Regression Model

Talha Saygili
CITS Tech
Published in
9 min readJun 11, 2021

--

Linear Regression is one of the most widely used predictive analysis methods. It is a structure that makes a name for itself, both because it is simple and can be used easily in various fields. The main purpose of this study is a numerical journey for those who have touched linear regression before and want to go over its important points. In this journey, there are necessary stops to create a Linear Regression Model and most importantly, it reveals the concept of “best” for whom and for what..

Photo by Clemens van Lay on Unsplash

What is Linear Regression?

Linear Regression, which we will examine under Machine Learning, is a type of predictive analysis used to create a connection between variables through a model. The main purpose of this structure is to obtain a linear line that will represent the actual data through a function, and to have information about the new data to be added with the help of this line.

So what does it mean to represent a data? What is the concept of “best” in the line that will “best” represent the data points?

GIF by primo.ai

What is the “Best” ?

Suppose we have data points with x-y axis values. Let’s examine the sample models by drawing different linear lines on this dataset, respectively. At what stage do you think the data is represented “best” or close to “best”?

It’s time to dismantle the “best” relativism. When choosing a effective linear regression model, the secret of the model’s fit to the data is to present the model with the least error that we call “cost”.

Cost is obtained by summing the squares of distances between the linear model and the available data points. In a way, it is a numerical measurement method of how well our model fits this data. Therefore; the farther a line function from the available data points, the greater the cost value. It is necessary to calculate this total value, called “sum of squared residuals”, for different functions and choose the most optimal one.

Creating the “Best” Model

Let’s repeat some terms over y = ax + b, also known as the general function.

In this structure; y is the dependent variable, x is the argument (independent variable), a is the parameter of the x-variable, and b is the intercept.

In another way, ax+b is the prediction function, the model. The sum of the squared of the differences between the estimated results and and the actual results will give the sum of squared residuals.

Now we know the main purpose. Now let’s move towards the goal with general concepts.

Photo by Andrew Ng

The linear regression model to be obtained is also called Hypothesis. The coefficients required in the model are called parameters. By comparing the created model with the current data points, the cost function value is obtained.The m value is the number of data available. In the selection of the “best” model, it is aimed to reach the hypothesis function formed by the parameters that minimize this cost function value.

Initially, θ0 and θ1 are assigned random numbers and the cost function value is calculated. Then, these numbers are changed and the new cost function value is found and this iteration continues. The minimum point is reached on the cost function graph obtained.

As can be seen, we found new values ​​over different models and graphed these values ​​under the name J(Q). The value of x-axis where J(Q) is minimum also tells us the value that the Q1 value should be.
Although this structure can be easily observed in models that need a single parameter, it will be difficult and almost impossible to observe when many parameters are needed. Therefore, if we grasp the logic here and do the same in multidimensional structures without visualizing; again, it will be possible to find the optimal parameter values.

The Secret Hero: Gradient Descent

GIF by 3Blue1Brown
GIF by 3Blue1Brown

Gradient Descent is, in a way, an optimization algorithm. It uses the derivative to find the minimum point. In this way, it updates the parameter values ​​by making positive or negative additions according to the slope. It can find different local minimums according to the initial values.

No matter how many parameters we need, this iteration, in which parameter values ​​are updated simultaneously, aims to find parameter values ​​that lead to the minimum cost function value. The iteration is completed when the derivative value is zero or close to zero.

Photo by Andrew Ng

To sum it up, we have data points. We need the cost function results in order to see how appropriate the parameters to the data. Among these results, we need the parameter values ​​of the least cost one. While it is more comfortable in terms of visual and logic in one-dimensional variable, this visualization is difficult in multi-parameter, that is, multidimensional model representations. So in multidimensional structures, we need to update the parameters with iterations until we find the minimum point. While finding this minimum, we update the parameters in the loop by making use of the derivative.

Let’s look at the dataset example where we need more than one parameter and observe some terms.

Our hypothesis function will be more crowded.

Untie the Knot with Feature Scaling

It is one of the important steps in terms of time and optimization for gradient descent. The values ​​are converted to similar structures and the gradient descent steps are accelerated. The features are being scaled at this stage. There are different approaches.

First, it can be obtained by dividing each feature by the maximum of its kind. Besides, placing values ​​between -1 and +1 is also an option. But the most common is to place it between 0-1. Another scaling method is mean normalization, in the form of (value-mean)/max value.

Jump with Learning Rate (α)

It is the α value in the Gradient Descent algorithm. It is an important value for faster progress of the iteration. If it is too big there will be splashes.
At very low levels, the desired gain may not be achieved in terms of time. Smaller values ​​will be testable as long as the J(Q) value is observed to decrease at each iteration.

Magic Hat: Feature Extraction

Deriving new features from existing features can also contribute to model building. It should be taken into account that the model to be chosen here avoids overfitting and minimizes the cost function value.

Model Evaluation

How successful is our model? What are our success criteria?

Let’s continue with the assumption that we have found the parameters and that we have a model. At this point, let’s open the terms R2 and p-value for R2, which will reveal the performance of the model. Then let’s explain the values ​​that these terms should have.

We see the formula R2. Let’s take a step-by-step look at how to find the terms in the formula.

The mean of each data point’s value on the y-axis is the mean-y value. From this value, we get SS(mean) when we add the squares of the distances of the y values ​​of each data point.

We will apply the same logic on the fit line that we have created.

The R2 value is the power of our feature to explain the dependent variable. It indicates what percentage of the variance of the dependent variable can be explained.

In addition, it is necessary to calculate the p-value for R2 in order to observe whether our model is out of luck. We find this value over the F value.

We know how the first part is calculated. The terms in the second part are;

Pfit is the number of parameters in the fit,

Pmean is the number of parameters in the mean line,

We shed light on all the unknown terms in our formula with the n value as well as the number of items in your dataset information.

If our model is good enough, R2 and F value should be large, p-value should be small number. So how do we get to the p-value after finding the F value? In short, it can be accessed through a graph. It is calculated over the area account with the F value distribution. That’s it!

— — — — — — — — — — — — — — — — — — — — — — — — ——

Behind every effective model is a gradient descent...

A cup of coffee behind every productive day…

GIF by GIPHY

Summary

We are approaching the last stop of our journey. If we take a look at the stops we passed during our journey; based on the definition of the Linear Regression Model, it relates variables and produces fast and practical solutions in many areas.

Then, we examined the solution steps of this structure. In this section, we have accepted that the relativity of the concept of “best” ends in mathematics and that our model with the least cost gives us the “best” solution.

While looking closely at the concept of “cost”, we came across the gradient descent structure. Now we have reached the brain function that makes up our model. We discovered that the structure, which reaches the optimal through derivative, proceeds in a loop with the parameters updated simultaneously.

In order for the Gradient Descent structure to progress more effectively, we touched on concepts such as feature scaling, learning rate and underlined the purposes of these concepts. With the formation of our model, we presented how well the model fits the data. In this part, which we see as a kind of success criteria, we have completed the R2 and F value calculations. Finally, we learned to interpret whether the model is usable or not, by addressing the values ​​that a successful model should have.

I would like to express my gratitude to StatQuest with Josh Starmer and Andrew Ng, who inspired me to prepare this study.

GIF by tenor

It will be my pleasure to receive your feedback.

See you on our next journey…

Enjoy your coffee!

--

--