Linear Regression, R², and p-value

7 min readDec 4, 2019

For people who don’t like to memorize the formulas, this article provides a solidified description for

Linear Regression Model
R²
p-value

In order to explain these, we need to understand some fundamental statistical terminologies clearly as well:

residual
Sum of squared residuals
least squared
the-best-fits-all

Use-case for Linear Regression: In supervised learning, trying to predict a numerical value based on observation is known as a regression. Let’s assume that we have a dataset where we have n-number of students’ weights and heights, which are scattered in the figure below. The goal is to train a linear regression model that takes a student’s weight and predicts his/her height.

The core idea is to predict the heights accurately. The linear regression model performs prediction based on a line that best-fits-all samples.

Q1: What does the best-fits-all mean?

Q2: How to find the line that best-fits-all?

1- Residuals and Sum of Squared Residuals:

As illustrated in the figure, residual is a vertical distance between the data sample point A (x1,y1) and vertically closest point L(x2,y2) on the line.

The residual between A(x1,y1) and L(x2,y2) is simply

residual_A_L = y1-y2

The sum of squared residuals: the formula is given in the figure, which is self-explanatory, calculated by summing squared residuals (every vertical distance between every data samples and the line).

2- Line Best-fits-all and Least Squared

Initially, we can draw a random line l as illustrated in the figure and

First, calculate the residuals (the vertical distance) for each sample. Second, we calculate the sum of squared residuals. Name it SSR_1

Then we rotate the line a bit and follow the First and Second step and name it SSR_2

Iteratively we calculate the SSR values:
SSR_3
SSR_4,
…
SSR_n

The line which gives the minimum SSR value is the line best-fits-all, and also this SSR is known as least squared

3- R²

In this use-case, the linear regression model takes an input, which is a student’s weight and predicts the student’s height. R² measures how good or bad the prediction is.

Let’s assume that this line mathematically represented as below:

y= 0.15+0.81x

The critical point is that the slope (coefficient of x) is 0.81. Since the slope is not zero, we accept that this line best-fits-all will be statistically useful while guessing a particular student’s height based on the student’s weight. This assumption also triggers and other question

Q3: How good/bad is that guess?

This how the guess is good or bad measured by R²! To be able to describe the R². Let’s introduce a few new terminologies :

Sum of the Squares Around the Mean
Variance Around the Mean — Var(mean)
Variance Around the Least Squares Fit — Var(fit)
Variance Around the Least Squares Fit -Var(fit)

3.1 Sum of the Squares Around The Mean

Sum of the Square around the height mean aka SS(mean) can be easily calculated as follows:

Calculate the mean of the heights,
Then calculate the height residuals to the mean; in other words, calculate the vertical distance between height and the mean. This is known as residual around the mean.
Finally, as illustrated in the figure, the SS(mean) equals the sum of the squared residuals around the mean.

3.2 Variance Around The Mean -Var(mean)

Besides, In general, the variance is the average sum of squares, so we can also calculate the variation around the mean aka var(mean)

3.3 Sum of Squares Around Least-Squares Fit — SS(fit)

Now Let’s go back to the line best-fits-all, and calculate the sum of squares on this line again which is known as the sum of squares around least-squares fit can be represented as SS(fit)

3.4 Variance Around the Least Squares Fit — Var(fit)

In general, the variance is the average sum of squares. Thus the variance around the least-squares fit is as follows:

3.5 R²

In the Linear Regression Model, the variation in the heights is explained by taking the weights into account; in other words, the havier students are taller, the lighter student is shorter.

R² tells us how much of the variation in the heights can be explained by taking the weights into account.

3.5.1 What R² indicates/means?

Finally, if the R² value is between 0 and 1, for example, R² equals 0.7 means that the student weight explains 70% of the student heights’ variation.

Furthermore, R² can also be calculated as follows:

3.6 R² in 3-Dimensional Space

If we would have the student age as an additional parameter in the dataset,

then the linear regression algorithm finds a plane that fits-best-all samples,

y = a+bx+c

calculates the residuals between the plane and data sample
Select the minimum sum of the squared residuals as least-squares, which SS(fit). Note that the additional age dimension is useless since it doesn’t make the SS(fit)smaller; thus, the age does no effect while predicting the student’s heights. This means the equation with more parameter will never make SS(fit) worse than the equation with fewer parameters
In contrast when we have more parameters R² well

In some cases, especially where there is a lack of sample

for instance, when there are only two samples regardless of the samples, the R² will be equal to 1 regardless of the sample

Henceforth, in some situation, we need to ensure that R² is statistically significant to be able to lean on the R² value or not

Q4: Can I always trust R² results?

R² is powerful; however, for some situations, it unreliable.

As given in the figure, for instance, if in the data set, there are only two samples A(x1,y1) and B(x2,y2). Since you can always draw a straight line to connect any two points, these two points already represent the line best-fits-all. Thus;

SS(fit) = 0

and regardless the value of SS(mean)

R²=1

Q5: How to make sure that the R² value is statistically significant?

the answer is: p-Value

4. p-Value

Recall that the formula of R²:

Which means is that R² equals to the variation in the student heights explained by weight / the independent variation in the student heights (without considering the weights). Since we clarified what the R² equation is, let’s talk about the p-value and then see the relation of p-value and R². In order to understand what p-value is first to need to talk about F-distribution

4.1 F-Distribution

Here is the formula of the F distribution :

First, we focus on the part which we are already familiar:

This part looks similar to the calculation of R², right? The numerators are the same and the dominators are different. Lets now focus on the second part of the F distribution formula and try to understand what does it represents.

p_fit: number of parameters in the line-best-fits-all

In general, a line is mathematically represented as follows

 y = ax+b

That means we have two coefficients ( a and b) in the line so for this example p_fit = 2. [Note that if we would have a best-fit-all plane since the plane formula would be, slightly different then the best-fit-all-line, y=ax+bx+c, then p_fit = 3]

p_mean: this is the number of parameter s in the mean line which is mathematically

y = ax

That means we have only one coefficient ( a ) in the mean

then (p_fit-p_mean) = (2–1) = 1

Finally, the numerator of the F distribution -below- represents the variance explained by the parameter, in our example, that’s the variance in student’s height size explained by student’s weight.