Linear Regression and Fitting a Line to a data

Published in

Analytics Vidhya

14 min readAug 17, 2020

Linear Regression is the Supervised Machine Learning Algorithm that predicts continuous value outputs. In Linear Regression we generally follow three steps to predict the output.

1. Use Least-Square to fit a line to data

2. Calculate R-Squared

3. Calculate p-value

Fitting a line to a data

There can be many lines that can be fitted within the data, but we have to consider only that one which has very less error.

Let say the bold line (‘b’), represents the average Y value and distance between b and all the datapoints known as residual.

(b-Y1) is the distance between b and the first datapoint. Similarly, (b-y2) and (b-Y3) is the distance between second and third datapoints and so on.

Note: Some of the datapoints are less than b and some are bigger so on adding they cancel out each other, therefore we take the squares of the sum of residuals.

SSR = (b-Y1)² + (b-Y2)² + (b-Y3)² + ………… + ……(b-Yn)² . where, n is the number of datapoints.

When for the line SSR is very less, the line is considered to be the best fit line. To find this best fit lines we need the help of equation of straight lines:

Y = mX+c

Where, m is the slope and c is the intercept through y_axis. Value of ‘m’ and ‘c’ should be optimal for SSR to be less.

SSR = ((mX1+c)-Y1)² + ((mX2+c)-Y2)² + ………. + …….

Where Y1, Y2, ……., Yn is the observed/actual value and,

(mX1+c), (mX2+c), ………. Are the value of line or the predicted value.

Since we want the line that will give the smallest SSR, this method of finding the optimal value of ‘m’ and ‘c’ is called Least-Square.

This is the plot of SSR versus Rotation of Lines. SSR goes down when we start rotating the line, and after a saturation point it starts increasing on further rotation. The line of rotation for which SSR is minimal is the best fitted line. We can use derivation for finding this line. On derivation of SSR, we get the slope of the function at every point, when at the point the slope is zero, the model select that line.

R-Squared

R-Squared is the goodness-of-fit measure for the linear regression model. It tells us about the percentage of variance in the dependent variable explained by Independent variables. R-Squared measure strength of relationship between our model and dependent variable on 0 to 100% scale. It explains to what extent the variance of one variable explains the variance of second variable. If R-Squared of any model id 0.5, then half of the observed variation can be explained by the model’s inputs. R-Squared ranges from 0 to 1 or 0 to 100%. Higher the R², more the variation will be explained by the Independent variables.

R² = Variance explained by model / Total variance

R² for the model on left is very less than that of the right.

But it has its limitations:

· R² tells us the variance in dependent variable explained by Independent ones, but it does not tell whether the model is good or bad, nor will it tell you whether the data and predictions are biased. A high R² value doesn’t mean that model is good and low R² value doesn’t mean model is bad. Some fields of study have an inherently greater amount of unexplained variation. In these areas R² value is bound to be lower. E.g., study that tries to predict human behavior generally has lower R² value.

· If we keep adding the Independent variables in our model, it tends to give more R² value, e.g., House cost prediction, number of doors and windows are the unnecessary variable that doesn’t contribute much in cost prediction but can increase the R² value. R Squared has no relation to express the effect of a bad or least significant independent variable on the regression. Thus, even if the model consists of a less significant variable say, for example, the person’s Name for predicting the Salary, the value of R squared will increase suggesting that the model is better. Multiple Linear Regression tempt us to add more variables and in return gives higher R² value and this cause to overfitting of model.

Due, to its limitation we use Adjusted R-Squared or Predicted R-Squared.

Calculation of R-Squared

Project all the datapoints on Y-axis and calculate the mean value. Just like SSR, sum of squares of distance between each point on Y-axis and Y-mean is known as SS(mean).

Note: (I am not trying to explain it as in mathematical formula, In Wikipedia and every other place the mathematical approach is given. But this is the theoretical way and the easiest way I understood it from Stat Quest. Before following mathematical approach, we should know the concept behind this).

SS(mean) = (Y-data on y-axis — Y-mean)²

SS(var) = SS(mean) / n .. where, n is number of datapoints.

Sum of Square around best fit line is known as SS(fit).

SS(fit) = (Y-data on X-axis — point on fit line)²

SS(fit) = (Y-actual — Y-predict)²

Var(fit) = SS(fit) / n

R² = (Var(mean) — Var(fit)) / Var(mean)

R² = (SS(mean) — SS(fit)) / SS(mean)

R² = 1 — SS(fit)/SS(mean)

Mathematical approach:

Here, SS(total) is same as SS(mean) i.e. SST (Total Sum of Squares) is the sum of the squares of the difference between the actual observed value y, and the average of the observed value (y mean) projected on y-axis.

Here, SSR is same as SS(fit) i.e. SSR (Sum of Squares of Residuals) is the sum of the squares of the difference between the actual observed value, y and the predicted value (y^).

Adjusted R-Squared:

Adjusted R-Squared adjusts the number of Independent variables in the model. Its value increases only when new term improves the model fit more than expected by chance alone. Its value decreases when the term doesn’t improve the model fit by sufficient amount. It requires the minimum number of data points or observations to generate a valid regression model.

Adjusted R-Squared use Degrees of Freedom in its equation. In statistics, the degrees of freedom (DF) indicate the number of independent values that can vary in an analysis without breaking any constraints.

Suppose you have seven pair of shoes to wear each pair on each day without repeating. On Monday, you have 7 different choices of pair of shoes to wear, on Tuesday choices decreases to 6, therefore on Sunday you don’t have any choice to wear which shoes you gonna wear, you are stuck with only the last left one to wear. We have no freedom on Sunday. Therefore, degree of freedom is how much an independent variable can freely vary to analyse the parameters.

Every time you add an independent variable to a model, the R-squared increases, even if the independent variable is insignificant. It never declines. Whereas Adjusted R-squared increases only when independent variable is significant and affects dependent variable. It penalizes you for adding independent variable that do not help in predicting the dependent variable.

for detail understanding watch Krish Naik

P — Value

Suppose in a co-ordinate plane of 2-D, within the axis of x and y lies two datapoints irrespective of coordinates, i.e. anywhere in the plane. If we draw a line joining them it will be the best fit line. Again, if we change the coordinate position of those two points and again join them, then also that line will be the best fit. No matter where the datapoints lie, the line joining them will always be the best fit and the variance around them will be zero, that gives

value always 100%. But that doesn’t mean those two datapoints will always be statistically significant, i.e. always give the exact prediction of target variable. To know about the statistically significant independent variables that gives good R² value, we calculate P — value.

Big Question — What is P — Value?

We still don’t know anything about P — Value. P — value is like Thanos and to defeat Thanos we have to deal with Infinity Stones first. P — value has its own infinity stones like alpha(α), F — score, z — score, Null Hypothesis, Hypothesis Testing, T-test, Z-test. Let’s first deal with F — score.

The fit line is the variance explained by the extra parameters. The distance between the fit line and the actual datapoints is known as Residuals. These Residuals are the variation not explained by the extra parameters in the fit.

For different random sets of datapoints (or samples), there will be different calculated F. let say, for thousands of samples there will be thousands of F. If we plot all the F on histogram plot, it will look something like this.

If we draw a line connecting the outer of all the F, we get like this

The shape of the line is been determined by the degrees of freedom

For the red line the sample size is smaller than the sample size of blue line, the blue line with more sample size is narrowing towards x-axis faster than that of red line. If the sample size is greater relative to the number of parameters in fit line then P — value will be smaller.

For further clear understanding about P — value, we need to first understand about the Hypothesis Testing.

Hypothesis Testing –

What is Hypothesis? Any guess done by us is Hypothesis. E.g.

1. On Sunday, Peter always play Basketball.

2. A new vaccine for corona will work great.

3. Sachin always scores 100 on Eden.

4. NASA may have detected a new species.

5. I can have a dozen eggs at a time. Etc. etc.

If we put all the above guessed sentence on a test, it is known as Hypothesis Testing.

1. If tomorrow is Sunday, then Peter will be found playing Basketball.

2. If this is the vaccine made for corona then it will work on corona patient.

3. If the match is going at Eden, then Sachin will score 100.

4. If there is any new species came to Earth, then it would have been detected by NASA.

5. If I had taken part in Egg eating competition, then I could have eaten a dozen egg at a time and might have won the competition.

6. If I regularly give water to plant, it will grow nice and strong.

7. If I’ll have a good coffee in morning, then I’ll work all day without being tired. Etc. etc.

You make a guess (Hypothesis), put it to test (Hypothesis testing). According to the University of California, a good Hypothesis should include both “If” and “then” statement and should include independent variable and dependent variable and can be put to test.

Null Hypothesis –

Null Hypothesis is the guess that we made. Any known fact can be a Null Hypothesis. Every guess we made above is Null Hypothesis. It can also be, e.g. our solar system has eight planets (excluding Pluto), Buffalo milk has more fat than that of cow, a ball will drop more quickly than a feather if been dropped freely from same height in vacuum.

Now here’s a catch. We can accept the Null Hypothesis or can reject the Null Hypothesis. We perform test on Null Hypothesis based on same observation or data, if the Hypothesis is true then we accept it or else we reject it.

Big Question? How this test is done?

We evaluate two mutual statement on a Population (millions of data containing Independent and dependent variables) data using sample data (randomly chosen small quantity of data from a big data). For testing any hypothesis, we have to follow few steps:

1. Make an assumption.

For e.g. let say A principal at a certain school claims that the students in his school are above average intelligence. A random sample of thirty students IQ scores have a mean score of 112. Is there sufficient evidence to support the principal’s claim? The mean population IQ is 100 with a standard deviation of 15.

Here the Null Hypothesis is the accepted fact that the average IQ is 100. i.e.

Let say that after testing our Null Hypothesis is true, i.e. the claim made by principal that average IQ of students is above 100 is wrong. We chosen different sets of 30 students and took their IQ and averaged it and found in most cases that the average IQ is not more than 100. Therefore, our Null Hypothesis is true and we reject Alternate Hypothesis. But let say that due to lack of evidence we can’t able to find out the result or somehow mistakenly (two or three students are exceptionally brilliant with much more IQ) we calculated that average IQ is above 100 but the actual correct result is average IQ is 100 and we reject Null Hypothesis, this is Type 1 error.

Again, let say that the Null Hypothesis is true, average IQ of students is not more than 100. But due to presence of those exceptionally brilliant students we got the average IQ above 100, so we do not reject Alternate Hypothesis. That is type 2 error.

It’s confusing though. Okay let’s take another example.

Suppose a person is somehow convicted but he is innocent, he just accidentally found present near a dead body and got convicted. Here Null Hypothesis, is person is innocent. Alternate Hypothesis can be that person is guilty but due to lack of evidence, person got charged and punished by law. So, it is Type 1 error. But what if person is actually guilty. He claimed that he is innocent, Alternate hypothesis suggested he is guilty but due to lack of evidence, he got bailed and charged free. This is type 2 error (not rejecting Alternate Hypothesis).

Choose the alpha(α), α is the significance level which is probability of making wrong decision when Null Hypothesis is true, i.e. probability of making type 1 error. Generally, we choose α = 0.05, it’s okay if for less than 5 % of cases Null Hypothesis is proven wrong, we still consider it. But if Null hypothesis is wrong more than in 5% case then we reject it and accept Alternate Hypothesis. For important decision, like in medical cases or share market we do not take α > 0.03, it could be risk even if we avoid a minute error in these cases.
1. Perform the test.

Z— test

Here, X(bar) is the sample mean, i.e. Average IQ of randomly chosen 30 students which is 112.

(Mu-0) is the population mean, i.e. Average IQ of all students that is 100.

(sigma) is the standard deviation, i.e. how much data is varying from the population mean?

n is the sample size that is 30.

Let’s discuss about the Normal distribution and Z-score before performing the test.

Normal Distribution

Properties:

· Bell shaped curve

· No skewness

· Symmetrical

· In normal distribution,

Mean = median = mode
Area Under Curve is 100% or 1
Mean = 0 and standard deviation, σ = 1

Z — score

Z — score tells us that how much a data, a score, a sample is been deviated from the mean of normal distribution. By the help of Z — score we can convert any score or sample distribution with mean and standard deviation other than that of a normal distribution, (i.e. when the data is skewed) to the mean and deviation of normal distribution of mean equal to zero and deviation equal to one.

From Z — score table, an area of α = 0.05 is equal to 1.645 of Z — score, which is smaller than Z value we get. So we will reject the Null Hypothesis in this case.

Now, we got the Z — score, we can calculate P — value, by the help of Normal Distribution Table:

By looking at Normal Distribution table we get that Z — value for the value less than — 3 is 0.001. If the P — value < 0.05, we reject the Null Hypothesis.

Big.. Big Confusion –

We often tend to confuse between probability and P — value. But there is a big difference between them. Let’s take an example.

By flipping a coin, we can get the chance of coming of head is 50%. If we flip another coin, again the chance of getting a head is 50%.

Now, what’s the probability of getting two head in a row and what’s the P — value of getting two head in a row?

On flipping of two coins simultaneously,

Total outcome = HH, HT, TH, TT = 4

Favorable outcome = HH = 1

P(HH) = ¼ = 0.25, and P(TT) = ¼ = 0.25

P(one H or one T) =(HT,TH)/(HH,HT,TH,TT)=2/4=0.5

P — value is the probability that the random chance generated the data, or something else that is equal or rarer.

Therefore, probability of getting two heads in a row is 0.25 and P — value for getting two heads in a row is 0.5.

All the graphical plots are taken from Stat Quest. for this article I have followed Stat Quest and Krish Naik.

If something is missing here or explained wronged, then please comment and guide me through it.

Linear Regression and Fitting a Line to a data

Fitting a line to a data

R-Squared

Written by Asitdubey