For people who don’t like to memorize the formulas, this article provides a solidified description for
- Linear Regression Model
- R²
- p-value
In order to explain these, we need to understand some fundamental statistical terminologies clearly as well:
- residual
- Sum of squared residuals
- least squared
- the-best-fits-all
Use-case for Linear Regression: In supervised learning, trying to predict a numerical value based on observation is known as a regression. Let’s assume that we have a dataset where we have n-number of students’ weights and heights, which are scattered in the figure below. The goal is to train a linear regression model that takes a student’s weight and predicts his/her height.
The core idea is to predict the heights accurately. The linear regression model performs prediction based on a line that best-fits-all samples.
Q1: What does the best-fits-all mean?
Q2: How to find the line that best-fits-all?
1- Residuals and Sum of Squared Residuals:
As illustrated in the figure, residual is a vertical distance between the data sample point A (x1,y1) and vertically closest point L(x2,y2) on the line.
The residual between A(x1,y1) and L(x2,y2) is simply
residual_A_L = y1-y2
The sum of squared residuals: the formula is given in the figure, which is self-explanatory, calculated by summing squared residuals (every vertical distance between every data samples and the line).
2- Line Best-fits-all and Least Squared
- Initially, we can draw a random line l as illustrated in the figure and
First, calculate the residuals (the vertical distance) for each sample. Second, we calculate the sum of squared residuals. Name it SSR_1
- Then we rotate the line a bit and follow the First and Second step and name it SSR_2
- Iteratively we calculate the SSR values:
- SSR_3
- SSR_4,
- …
- SSR_n
The line which gives the minimum SSR value is the line best-fits-all, and also this SSR is known as least squared
3- R²
In this use-case, the linear regression model takes an input, which is a student’s weight and predicts the student’s height. R² measures how good or bad the prediction is.
Let’s assume that this line mathematically represented as below:
y= 0.15+0.81x
The critical point is that the slope (coefficient of x) is 0.81. Since the slope is not zero, we accept that this line best-fits-all will be statistically useful while guessing a particular student’s height based on the student’s weight. This assumption also triggers and other question
Q3: How good/bad is that guess?
This how the guess is good or bad measured by R²! To be able to describe the R². Let’s introduce a few new terminologies :
- Sum of the Squares Around the Mean
- Variance Around the Mean — Var(mean)
- Variance Around the Least Squares Fit — Var(fit)
- Variance Around the Least Squares Fit -Var(fit)
3.1 Sum of the Squares Around The Mean
Sum of the Square around the height mean aka SS(mean) can be easily calculated as follows:
- Calculate the mean of the heights,
- Then calculate the height residuals to the mean; in other words, calculate the vertical distance between height and the mean. This is known as residual around the mean.
- Finally, as illustrated in the figure, the SS(mean) equals the sum of the squared residuals around the mean.
3.2 Variance Around The Mean -Var(mean)
Besides, In general, the variance is the average sum of squares, so we can also calculate the variation around the mean aka var(mean)
3.3 Sum of Squares Around Least-Squares Fit — SS(fit)
Now Let’s go back to the line best-fits-all, and calculate the sum of squares on this line again which is known as the sum of squares around least-squares fit can be represented as SS(fit)
3.4 Variance Around the Least Squares Fit — Var(fit)
In general, the variance is the average sum of squares. Thus the variance around the least-squares fit is as follows:
3.5 R²
In the Linear Regression Model, the variation in the heights is explained by taking the weights into account; in other words, the havier students are taller, the lighter student is shorter.
R² tells us how much of the variation in the heights can be explained by taking the weights into account.
3.5.1 What R² indicates/means?
Finally, if the R² value is between 0 and 1, for example, R² equals 0.7 means that the student weight explains 70% of the student heights’ variation.
Furthermore, R² can also be calculated as follows:
3.6 R² in 3-Dimensional Space
If we would have the student age as an additional parameter in the dataset,
- then the linear regression algorithm finds a plane that fits-best-all samples,
y = a+bx+c
- calculates the residuals between the plane and data sample
- Select the minimum sum of the squared residuals as least-squares, which SS(fit). Note that the additional age dimension is useless since it doesn’t make the SS(fit)smaller; thus, the age does no effect while predicting the student’s heights. This means the equation with more parameter will never make SS(fit) worse than the equation with fewer parameters
- In contrast when we have more parameters R² well
In some cases, especially where there is a lack of sample
- for instance, when there are only two samples regardless of the samples, the R² will be equal to 1 regardless of the sample
Henceforth, in some situation, we need to ensure that R² is statistically significant to be able to lean on the R² value or not
Q4: Can I always trust R² results?
R² is powerful; however, for some situations, it unreliable.
As given in the figure, for instance, if in the data set, there are only two samples A(x1,y1) and B(x2,y2). Since you can always draw a straight line to connect any two points, these two points already represent the line best-fits-all. Thus;
SS(fit) = 0
and regardless the value of SS(mean)
R²=1
Q5: How to make sure that the R² value is statistically significant?
the answer is: p-Value
4. p-Value
Recall that the formula of R²:
Which means is that R² equals to the variation in the student heights explained by weight / the independent variation in the student heights (without considering the weights). Since we clarified what the R² equation is, let’s talk about the p-value and then see the relation of p-value and R². In order to understand what p-value is first to need to talk about F-distribution
4.1 F-Distribution
Here is the formula of the F distribution :
First, we focus on the part which we are already familiar:
This part looks similar to the calculation of R², right? The numerators are the same and the dominators are different. Lets now focus on the second part of the F distribution formula and try to understand what does it represents.
- p_fit: number of parameters in the line-best-fits-all
In general, a line is mathematically represented as follows
y = ax+b
That means we have two coefficients ( a and b) in the line so for this example p_fit = 2. [Note that if we would have a best-fit-all plane since the plane formula would be, slightly different then the best-fit-all-line, y=ax+bx+c, then p_fit = 3]
- p_mean: this is the number of parameter s in the mean line which is mathematically
y = ax
That means we have only one coefficient ( a ) in the mean
then (p_fit-p_mean) = (2–1) = 1
Finally, the numerator of the F distribution -below- represents the variance explained by the parameter, in our example, that’s the variance in student’s height size explained by student’s weight.
…coming soon…!