# Measuring the Goodness of Fit: R² versus Adjusted R²

**Linear Regression** deals with modeling a linear relationship between a dependent variable and several explanatory variables. Essentially, we want to fit a line through our data points in space.

But there can be multiple possible lines. Thus to find the best line, there is a need for a measure of goodness of fit, ergo the **Coefficient of Determination — R².**

*Before you proceed, here is a list of **5 must-read articles** on Regression to brush up your basics:*

*A Complete Tutorial to learn Data Science in R from scratch**5 Questions which can teach you Multiple Regression (with Python and R)**7 Types of Regression Techniques you should know!**A Complete Tutorial on Ridge and Lasso Regression in Python**45 Questions to test a Data Scientist on Regression*

You can also check out (and participate in) a detailed discussion on this question with answers from other data science experts here:

*https://discuss.analyticsvidhya.com/t/difference-between-r-square-and-adjusted-r-square/264/15*

# Coefficient of Determination — R²

Consider the case of **simple regression** as shown in the figure below.

Circle Y represents the **variation** in our dependent variable, circle X, in our independent variable.

The overlap of the circles represents the **extent to which the variation in Y is explained by the variation in X.**

The coefficient *r²* is the numerical measure of this overlap and is called the **Coefficient of Determination**. This becomes the **Multiple Coefficient of Determination** *R²* in the case of Multiple Regression.

To state formally,

R² measures the proportion of the total variation in Y explained by the regression model.

This variation is measured in terms of the sum of squared errors from the mean.

If we have ** n** observations

**and**

*(X_i, Y_i)***is our predictor function,**

*f(.)*Here are some key points about R²:

- It is a non-negative quantity with range
*0 ≤ R² ≤ 1* - R² = 0 implies that the regression line does not fit the data at all.
- R² = 1 implies that the regression line is a perfect fit.

## Problem with R² — Value increases with the number of explanatory variables

Think about it. R² is the ratio of the explained variance to the total variance. On adding a new variable the explained variance and hence the value of R² ** will increase, or at least, will not decrease**.

However, this ** does not at all** mean that the model with the added variable is better than the model without it. R² can be misleading if used to compare models with a different number of predictors.

# Adjusted R²

Adjusted R² is a modified version of R² adjusted with the number of predictors. It penalizes for adding unnecessary features and allows a comparison of regression models with a different number of predictors.

Here ** k** is the number of explanatory variables in the model and

**is the number of observations.**

*n***The value of adjusted R² is always less than that of R².**

The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance.

Also, note that the value of adjusted R² can be negative.

Obtaining a negative value for Adjusted R² can indicate few or all of the following:

- The linear model is a poor fit for the data
- The number of predictors is large
- The number of samples is small

# R² and Adjusted R² in Python

Generate a random dataset first

** X** has 6 features. This is how the dependent variable

**varies with each of these features:**

*y**y versus X[2]* or *y versus X[4] *could possibly be linear relationships. Let us fit a line using **Ordinary Least Squares** Regression between ** y** and

**using the**

*X[4]***library in Python. The summary() function can be used to view the R² and Adjusted R² coefficients.**

*statsmodels*Here is the summary:

Now add another explanatory variable, ** X[3]**, to see the effect on both the coefficients.

Interesting! The value of R² increased while the value of adjusted R² dropped. ** X[3]** is an insignificant feature to add to the linear relationship. This is further confirmed on finding the adjusted R² for the OLS regression on

**versus**

*Y***individually. We observe that it is negative.**

*X[3]*You can also use the ** r2_score** function from the metrics module in

**sklearn.**However, there is no such function to find the adjusted r2_score. We need to use its formula to calculate it.

Here is the regression analysis of the famous ** mtcars** dataset without the categorical variables in sklearn:

We obtain an R² score of about 0.81 and an adjusted R² score of about 0.77

# End Notes

Although both R² and adjusted R² are measures of goodness of fit, linear regression models should not be built with the sole goal of maximizing these coefficients. This will eventually tempt us to introduce many insignificant predictors in our model. As a result, our model will be inaccurate and unreliable.