R2 Score: Linear Regression

Deependra Verma
5 min readMay 28, 2023

--

The R2 score, also known as the coefficient of determination, is a statistical measure used to assess the goodness of fit of a regression model. It indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. In other words, it measures how well the regression model fits the observed data.

The R2 score ranges from 0 to 1, where 0 indicates that the model does not explain any of the variability in the dependent variable, and 1 indicates that the model perfectly predicts the dependent variable.

Real understanding of R2 score:

As we have heard about this definition many times, do we really know the real meaning of this R2 score? I hope many of us have come across this term many times while using linear regression, and if not, you will use it in the near future while doing data science, managing operations in industries or organisations, agricultural science, medical research, and many more.

So without losing a second, let us take a very simple example: suppose you come across data where you are having packages of all the candidates of any particular organisation with respect to years of experience, and if someone asked you what will be the upcoming salary of any candidates this year, then the first thing that comes to mind is the average of all the salaries. Let us plot the graph of all the data and the mean line —

Salary vs. Years of Experience (with Mean Value as a Predictor)

But is this correct?

As we know, the error is the difference between the actual data and the predicted data. Since the prediction depends on the average of all past data only, it will vary from the actual prediction a lot with large errors, as we can see from the graph too.

Since we have one more feature, “years of experience,” if we draw the relationship, it creates a linear dependency on salary. If we made a linear model to form the best fit line (which will help us predict) on the input data, now if we plot the graph again to show the best fit line with the input data, it will look like this:

Salary vs. Years of Experience (with the best fit line as a predictor)

In this case, the prediction will be based on the best fit line (which comes from the linear equation where the independent variable is “years of experience” and the dependent variable is “salary”). In this case, the error will be reduced.

Here comes the role of the R2 score: the
R2 score will tell you how much better the regression line (the best fit line) is with respect to the mean line.

For this reason, the R2 score is also called “coeffecient of determination” or “goodness of fit”.

Calculation of the R2 score:

For the calculation of the R2 score, we need two parameters: the first is the sum of squared error by regression line, and the second is the sum of squared error by mean line.

R2 score formula

R2 score formula

where,

Sum of squared error by regression line
Sum of squared error by mean line

It means it will give you a comparison of the regression line with respect to the mean line. So, the R2 score will vary from 0 to 1. It will be zero when the sum of squared error by mean line becomes equal to the sum of squared error by regression line, and 1 when the sum of squared error by regression line is zero.

A few points to remember:

  1. R2 = 0 :

When the sum of the squared error by the regression line is equal to the sum of the squared regression line by the mean line, then R2 = 0. It means that the regression model is not taking advantage of the input data and acting as a mean line. So the performance of the regression model is the worst in this case.

2. R2 = 1 :

When the sum of the squared error of the regression line is zero, then R2 = 1. It means the regression line is not committing any error (overfitting case). It means that if we minimise the error of the regression line, the R2 score of the model will approach 1.

3. R2<0 :

When the sum of the squared error by the regression line is greater than the sum of the squared error by the mean line. This is a very rare case and depicts the worst model. It generally happens when we apply linear regression to highly nonlinear data.

Significance of the R2 score:

  1. As we know, we have to evaluate the regression model after building it. We use, mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE) to evaluate the regression model. Similarly, we will use the R2 score to evaluate the regression model. In the above example, if the permissible error is 10,000 and we use MSE, MAE, and RMSE, then by calculating MSE, MAE, and RMSE, we can have an idea that if the value of these parameters (MSE, MAE, and RMSE) is less than the permissible error, then we can say that our model is working well. It means that we should have one reference point to validate the MAE, MSE, and RMSE. But the R2 score is universal or absolute in nature. It does not require any reference point to validate the model’s accuracy. In general, if the R2 score is greater than 0.75, it is assumed to be good.
  2. Now we know that a higher R2 score, means the regression model is good. Suppose the R2 score is 0.80. What does this R2 score of 0.80 mean? It means that the 80% explanation of the output value of the regression model depends on the input value. The rest (20%) of the reason is indeterminate. So in the above example, if the R2 score is 0.80, then the years of experience explain 80% of the candidate’s salary.

Conclusion:

In conclusion, the R2 score is a statistical measure used to evaluate the goodness of fit of a regression model. It indicates the proportion of the variance in the dependent variable that can be explained by the independent variables. A higher R2 score suggests a better fit, while a lower score indicates less explanatory power. The R2 score is valuable for comparing models, assessing performance, and understanding the relationship between variables.

--

--