Linear Regression, R², and p-value

Caner
7 min readDec 4, 2019

--

For people who don’t like to memorize the formulas, this article provides a solidified description for

  • Linear Regression Model
  • p-value

In order to explain these, we need to understand some fundamental statistical terminologies clearly as well:

  • residual
  • Sum of squared residuals
  • least squared
  • the-best-fits-all

Use-case for Linear Regression: In supervised learning, trying to predict a numerical value based on observation is known as a regression. Let’s assume that we have a dataset where we have n-number of students’ weights and heights, which are scattered in the figure below. The goal is to train a linear regression model that takes a student’s weight and predicts his/her height.

The core idea is to predict the heights accurately. The linear regression model performs prediction based on a line that best-fits-all samples.

Q1: What does the best-fits-all mean?

Q2: How to find the line that best-fits-all?

1- Residuals and Sum of Squared Residuals:

As illustrated in the figure, residual is a vertical distance between the data sample point A (x1,y1) and vertically closest point L(x2,y2) on the line.

The residual between A(x1,y1) and L(x2,y2) is simply

residual_A_L = y1-y2

The sum of squared residuals: the formula is given in the figure, which is self-explanatory, calculated by summing squared residuals (every vertical distance between every data samples and the line).

2- Line Best-fits-all and Least Squared

  • Initially, we can draw a random line l as illustrated in the figure and

First, calculate the residuals (the vertical distance) for each sample. Second, we calculate the sum of squared residuals. Name it SSR_1

  • Then we rotate the line a bit and follow the First and Second step and name it SSR_2
  • Iteratively we calculate the SSR values:
  • SSR_3
  • SSR_4,
  • SSR_n

The line which gives the minimum SSR value is the line best-fits-all, and also this SSR is known as least squared

3- R²

In this use-case, the linear regression model takes an input, which is a student’s weight and predicts the student’s height. R² measures how good or bad the prediction is.

Let’s assume that this line mathematically represented as below:

y= 0.15+0.81x

The critical point is that the slope (coefficient of x) is 0.81. Since the slope is not zero, we accept that this line best-fits-all will be statistically useful while guessing a particular student’s height based on the student’s weight. This assumption also triggers and other question

Q3: How good/bad is that guess?

This how the guess is good or bad measured by R²! To be able to describe the R². Let’s introduce a few new terminologies :

  • Sum of the Squares Around the Mean
  • Variance Around the Mean — Var(mean)
  • Variance Around the Least Squares Fit — Var(fit)
  • Variance Around the Least Squares Fit -Var(fit)

3.1 Sum of the Squares Around The Mean

Sum of the Square around the height mean aka SS(mean) can be easily calculated as follows:

  • Calculate the mean of the heights,
  • Then calculate the height residuals to the mean; in other words, calculate the vertical distance between height and the mean. This is known as residual around the mean.
  • Finally, as illustrated in the figure, the SS(mean) equals the sum of the squared residuals around the mean.

3.2 Variance Around The Mean -Var(mean)

Besides, In general, the variance is the average sum of squares, so we can also calculate the variation around the mean aka var(mean)

3.3 Sum of Squares Around Least-Squares Fit — SS(fit)

Now Let’s go back to the line best-fits-all, and calculate the sum of squares on this line again which is known as the sum of squares around least-squares fit can be represented as SS(fit)

3.4 Variance Around the Least Squares Fit — Var(fit)

In general, the variance is the average sum of squares. Thus the variance around the least-squares fit is as follows:

3.5 R²

In the Linear Regression Model, the variation in the heights is explained by taking the weights into account; in other words, the havier students are taller, the lighter student is shorter.

R² tells us how much of the variation in the heights can be explained by taking the weights into account.

3.5.1 What R² indicates/means?

Finally, if the R² value is between 0 and 1, for example, R² equals 0.7 means that the student weight explains 70% of the student heights’ variation.

Furthermore, R² can also be calculated as follows:

3.6 R² in 3-Dimensional Space

If we would have the student age as an additional parameter in the dataset,

  • then the linear regression algorithm finds a plane that fits-best-all samples,
y = a+bx+c
  • calculates the residuals between the plane and data sample
  • Select the minimum sum of the squared residuals as least-squares, which SS(fit). Note that the additional age dimension is useless since it doesn’t make the SS(fit)smaller; thus, the age does no effect while predicting the student’s heights. This means the equation with more parameter will never make SS(fit) worse than the equation with fewer parameters
  • In contrast when we have more parameters R² well

In some cases, especially where there is a lack of sample

  • for instance, when there are only two samples regardless of the samples, the R² will be equal to 1 regardless of the sample

Henceforth, in some situation, we need to ensure that R² is statistically significant to be able to lean on the R² value or not

Q4: Can I always trust R² results?

R² is powerful; however, for some situations, it unreliable.

As given in the figure, for instance, if in the data set, there are only two samples A(x1,y1) and B(x2,y2). Since you can always draw a straight line to connect any two points, these two points already represent the line best-fits-all. Thus;

SS(fit) = 0

and regardless the value of SS(mean)

R²=1

Q5: How to make sure that the R² value is statistically significant?

the answer is: p-Value

4. p-Value

Recall that the formula of R²:

Which means is that R² equals to the variation in the student heights explained by weight / the independent variation in the student heights (without considering the weights). Since we clarified what the R² equation is, let’s talk about the p-value and then see the relation of p-value and R². In order to understand what p-value is first to need to talk about F-distribution

4.1 F-Distribution

Here is the formula of the F distribution :

First, we focus on the part which we are already familiar:

This part looks similar to the calculation of R², right? The numerators are the same and the dominators are different. Lets now focus on the second part of the F distribution formula and try to understand what does it represents.

  • p_fit: number of parameters in the line-best-fits-all

In general, a line is mathematically represented as follows

 y = ax+b

That means we have two coefficients ( a and b) in the line so for this example p_fit = 2. [Note that if we would have a best-fit-all plane since the plane formula would be, slightly different then the best-fit-all-line, y=ax+bx+c, then p_fit = 3]

  • p_mean: this is the number of parameter s in the mean line which is mathematically
y = ax

That means we have only one coefficient ( a ) in the mean

then (p_fit-p_mean) = (2–1) = 1

Finally, the numerator of the F distribution -below- represents the variance explained by the parameter, in our example, that’s the variance in student’s height size explained by student’s weight.

…coming soon…!

--

--