Multicollinearity — How to fix it?

FAHAD ANWAR
Analytics Vidhya
Published in
6 min readOct 19, 2019

This post will answer questions like What is multicollinearity? What are the problems that arise out of Multicollinearity? When do I have to fix Multicollinearity? and How to fix Multicollinearity?

One of the important aspect that we have to take care of while regression is Multicollinearity. Check this post to find an explanation of Multiple Linear Regression and dependent/independent variables.

Just a refresher,

  • Dependent variable is the one that we want to predict.
  • Independent variable is the one that is used to predict the dependent variable.
  • Our goal in regression is to find out which of the independent variables can be used to predict dependent variable.

What is Multicollinearity?

Lets see what Multicollinearity is and why we should be worried about it.

One of the conditions for a variable to be an Independent variable is that it has to be independent of other variables. i.e We shouldn’t be able to derive the values of this variable using other independent variables.

As we have seen in my previous posts, The equation of dependent variable with respect to independent variables can be written as

Let’s take an example of Loan Data.

if X1 = Total Loan Amount, X2 = Principal Amount, X3 = Interest Amount.

We can find out the value of X1 by (X2 + X3). This indicates that there is strong multicollinearity among X1, X2 and X3.

Our Independent Variable (X1) is not exactly independent.

If you look at the equation, you can see X1 is accompanied with m1 which is the coefficient of X1.

For Linear Regression, coefficient (m1) represents the mean change in the dependent variable (y) for each 1 unit change in an independent variable (X1) when you hold all of the other independent variables constant.

For example, in the previous post , we saw the equation for predicted medical expense to be
predicted_expense = ( age x 255.3)+ ( bmi x 318.62) + ( children x 509.21) + ( smoker x 23240) -( region_southeast x 777.08) - (region_southwest x 765.40)

In case of smoker, the coefficient is 23,240. Which means predicted expense will increase by 23,240 if the person is a smoker , and reduces by 23,240 if the person is a non-smoker (provided all other variables are constant).

What are the problems that arise out of multicollinearity?

In our Loan example, we saw that X1 is the sum of X2 and X3. Because of this relationship, we cannot expect the values of X2 or X3 to be constant when there is a change in X1.
So, in this case we cannot exactly trust the coefficient value (m1) .We don’t know the exact affect X1 has on the dependent variable.

In the post Feature Elimination Using p-values, we discussed about p-values and how we use that value to see if a feature/independent variable is statistically significant or not.
Since multicollinearity reduces the accuracy of the coefficients, We might not be able to trust the p-values to identify independent variables that are statistically significant.

When do I have to fix Multicollinearity?

The Good News is that Multicollinearity only affects the coefficients and p-values, but it does not influence the model’s ability to predict the dependent variable.

Which means that if you only care about prediction values, you don’t really have to worry about multicollinearity. But in some business cases, we would actually have to focus on individual independent variable’s affect on the dependent variable. Then in that case we have to reduce multicollinearity in the data.

We saw what Multicollinearity is and what are the problems that it causes. Now we will see how to fix it.

How to fix Multicollinearity?

Once you have decided that multicollinearity is a problem for you and you need to fix it, you need to focus on Variance Inflation Factor (VIF). VIF values help us in identifying the correlation between independent variables.

Before you start, you have to know the range of VIF and what levels of multicollinearity does it signify.

We usually try to keep multicollinearity in moderate levels. So, we have to make sure that the independent variables have VIF values < 5.

Let’s take the loan data example.

Loan data has the following columns,
loan_amnt: Loan Amount sanctioned
total_pymnt: Total Amount Paid till now
total_rec_prncp: Total Principal Amount Paid till now
total_rec_int: Total Interest Amount Paid till now
term: Term of the loan
int_rate: Interest Rate
loan_status: Status of the loan (Paid or Charged Off)

# Load csv into DataFrame 
data = pd.read_csv("loan.csv")
# Data Cleanup
data.term = data['term'].apply(lambda x:int(x.replace('months',''))) data.int_rate = data.int_rate.apply(lambda x:float(x.replace('%','')))
data = data[data['loan_status']!='Current']
data['loan_status'] = data['loan_status'].map({'Charged Off':1, 'Fully Paid':0})
# Separating independent variables into X and dependent variable into y
X = data[['loan_amnt','total_pymnt','total_rec_prncp','total_rec_int','term','int_rate']]
y = data['loan_status']
X = sm.add_constant(X)
X.head()

Just to get a peek at the correlation between variables, we use heatmap()

sns.heatmap(X.corr(), annot=True)
Higher the absolute value, more is the correlation

But, this isn’t feasible when the number of columns is high.

Let’s calculate VIF values for each independent column .

# Function to calculate VIF
def calculate_vif(data):
vif_df = pd.DataFrame(columns = ['Var', 'Vif'])
x_var_names = data.columns
for i in range(0, x_var_names.shape[0]):
y = data[x_var_names[i]]
x = data[x_var_names.drop([x_var_names[i]])]
r_squared = sm.OLS(y,x).fit().rsquared
vif = round(1/(1-r_squared),2)
vif_df.loc[i] = [x_var_names[i], vif]
return vif_df.sort_values(by = 'Vif', axis = 0, ascending=False, inplace=False)

Let’s fit a Linear Regression model and check the coefficients. Also , calculate VIF values

lm = sm.OLS(y, X).fit() 
print("Coeffients: \n{0}".format(lm.params))
calculate_vif(X)

Please ignore the ‘const’ column for now. And we can see really low coefficients because probably these variables have very little influence on the dependent variable. But we are not here to discuss that. Let’s focus on VIF values.

As we can see that total_pymnt , total_rec_prncp, total_rec_int have VIF>5 (Extreme multicollinearity). Which is obvious since total_pymnt = total_rec_prncp + total_rec_int

To reduce multicollinearity, let’s remove the column with the highest VIF and check the results.

# Dropping total_pymnt as VIF was highest 
X.drop(['total_pymnt'], axis=1, inplace=True)
lm = sm.OLS(y, X).fit()
print("Coeffients: \n{0}".format(lm.params))
calculate_vif(X)

If you notice, the removal of ‘total_pymnt’ changed the VIF value of only the variables that it had correlations with (total_rec_prncp, total_rec_int).

The coefficients of the independent variables before and after reducing multicollinearity.
There is significant change between them.
total_rec_prncp: -0.000089 -> -0.000069
total_rec_int: -0.000007 -> 0.000015

So, finally we were successful in bringing multicollinearity to moderate levels and now our dependent variables have VIF < 5.

That’s it for this post!.

Here’s my GitHub for Jupyter Notebooks on Linear Regression. Please feel free to check it out and suggest more ways to reduce multicollinearity here in responses.

Thank you for reading!

Please check out my other posts at Medium and follow me.

Originally published at https://machinelearningmind.com on October 19, 2019.

--

--