R-Square(R²) and Adjusted R-Square

Ujjawal Verma
Analytics Vidhya
Published in
6 min readJan 2, 2020

--

Hi everyone, today we will talking about the R-Square and Adjusted R-Square, so to get more knowledge about the goodness of your linear model keep stay here.The objective of any regression exercise is to explain the variation in the dependent variable Y. As far as regression models are concerned next step to evaluate the model performance and understand how good our model is against a benchmark model.

In this blog we will discuss about the things are mentioned below:

  • What is R² ?
  • How to Calculate R² ?
  • Range of R².
  • What is a good R² value?
  • Limitation of R².
  • Adjusted R².

What is R²

R-square(R²) is also known as the coefficient of determination, It is the proportion of variation in Y explained by the independent variables X. It is the measure of goodness of fit of the model.

If R² is 0.8 it means 80% of the variation in the output can be explained by the input variable. So, in simple term higher the R², the more variation is explained by your input variable and hence better is your model.

How to Calculate R-square (R²) ?

R² is the ratio between the residual sum of squares and the total sum of squares.

  • SSR (Sum of Squares of Residuals) is the sum of the squares of the difference between the actual observed value (y) and the predicted value (y^)
  • SST (Total Sum of Squares) is the sum of the squares of the difference between the actual observed value (y) and the average of the observed y value (yavg)

Let us understand these terms with the help of an example. Consider a simple example where we have some observations on how the experience of a person affects the salary.

We have the black line which is the regression line that depicts where the predicted values of Salary lies with respect to the experience along the x-axis. The stars represent the actual values of the salary which is the observed y value with respect to experience. The cross marks represent the predicted value of salary for an observed value of experience which is denoted by y^.

Where n is the number of observations.

Note :SSR is the best fitting criteria for a regression line. That is the regression algorithm chooses the best regression line for a given set of observations by drawing random lines and comparing the SSR of each line. The line with the least value of SSR is the best fitting line.

The black line in the above image denotes where the average Salary lies with respect to the experience.

R squared can now be calculated by,

Range of R-square (R²)

Generally, it is said that the range of R² is 0 to 1, but it is actually (-infinity) to 1.

=0 :- It indicate poor fit of the regression line to the data. i.e. no linear relationship between X and Y.

= 1 :- It indicate a perfect fit

= Negative :- It is negative when the prediction is so bad that the Residual Sum of Squares becomes greater than the Total Sum of Squares.

And what does a negative R-square mean?

It means that the model is performing worse than the horizontal line which predicts the mean value every time.

What is a good R² value?

A value of 0 indicate that the dependent variable cannot be explained by the independent variable at all.

A value of 1 indicate that the dependent variable can be perfectly explained without error by the independent variable.

Increasing from left to right

Now consider a hypothetical situation when all the predicted values exactly match the actual observations in the dataset. In this case, y will be equal to y^ for all the observations, hence resulting in SSR to be equal to zero. So R² = 1.

In another scenario, if the predicted values lie far away from the actual observations, SSR will increase towards infinity. This will increase the ratio SSR/SST, hence resulting in a decreased value for R Square. R² = -ve.

Thus R² will help us determine the best fit for a model. The closer R² is closer to one means regression goes to better.

When is R-square negative?

Appearances can be deceptive. R² is not really the square of anything. While it is surprising to see something called “squared” have a negative value, it is not impossible (since R² is not actually the square of R).

R² will be negative when the best-fit line or curve does an awful job of fitting the data. This can only happen when you fit a poorly chosen model (perhaps by mistake), or you apply constraints to the model that don’t make any sense (perhaps you entered a positive number when you intended to enter a negative number).

Example

Above is a simple example. The blue line is the fit of a straight line constrained to intercept the Y axis at Y=150 when X=0. SSR is the sum of the squares of the distances of the red points from this blue line. SST is the sum of the squares of the distances of the red points from the green horizontal line. Since SSR is much larger than SST, the R²(for the fit of the blue line) is negative.

If R² is negative, check that you picked an appropriate model, and set any constraints correctly.

Limitation of R².

As above we considered a simple example where we have two variables Experience and Salary. we are predicting the salary based on the experience of employee. R² and regression can be calculated as below:

And there is a problem which is if we add another variables in second equation.

Once we added a new variable to our model SSR will minimize and SST will not affecting, then (SSR / SST ) will decrease. So the value of R² will increase. Now this is the limitation of R² when we add variables the R² will never decrease.

So after adding variable you don’t find how it will affect your model or not because R² is never goes to decrease it will increasing always after adding variable. So to overcome this problem we are using Adjusted R².

Adjusted R²

same as R², the Adjusted R² measures the variation in the dependent variable.

The formula for Adjusted R-square:

Adjusted R² formula

While R² increases as variables are added, the fraction n-1/n-p-1 increases as variables are added.

Thus the concept of adjusted R² imposes a cost on adding variables to the regression. So, Adjusted R-square can decrease when variables are added to a regression.

Hence, adjusted R² will only increase when the added variable is relevant.

Note that Adjusted R² is always less than or equal to R².

Therefore, it is recommended to use Adjusted R² over R² when measuring the goodness of fit of the model.

So this is about the R² and adjusted R², I hope you enjoyed this! 👍

If you have any questions or suggestions, please let me know!

Thank You! 😊

--

--