“Unpacking Multicollinearity: Understanding its Impact on Regression Analysis.”

ajaymehta
8 min readMay 24, 2023

--

Multicollinearity is a statistical term used to describe a situation in which two or more independent variables in a regression model are highly correlated with each other. When multicollinearity is present, it can cause problems with interpreting the results of the regression model.

To explain with an example, suppose we want to predict a person’s salary based on their CGPA, IQ, and package (salary offered in their previous job). We can create a regression model with these three variables as predictors:

Salary = β0 + β1(CGPA) + β2(IQ) + β3(Package)

However, suppose we find that CGPA and IQ are highly correlated with each other. This means that as one variable increases, the other tends to increase as well. This is a problem because it makes it difficult to determine the independent effect of each variable on the outcome variable (salary). In other words, if we find that people with high CGPA tend to have high salaries, we cannot be sure if this is because of their CGPA or because of their high IQ.

To avoid this problem, we can use statistical methods such as variance inflation factor (VIF) or correlation matrix to identify the variables with high correlation and remove one of the variables to reduce multicollinearity.

In summary, multicollinearity is a situation in which independent variables in a regression model are highly correlated with each other, making it difficult to determine the independent effect of each variable on the outcome variable. In the example of predicting salary using CGPA, IQ, and package, multicollinearity can arise if CGPA and IQ are highly correlated with each other and can be addressed by identifying and removing one of the correlated variables.

When is Multicollinearity bad?

  1. Inference:
  • Inference focuses on understanding the relationships between the variables in a model.
  • It aims to draw conclusions about the underlying population or process that generated.
    the data.
  • Inference often involves hypothesis testing, confidence intervals, and determining the
    significance of predictor variables.
  • The primary goal is to provide insights about the structure of the data and the relationships between variables.
  • Interpretability is a key concern when performing inference, as the objective is to understand the underlying mechanisms driving the data.
  • Examples of inferential techniques include linear regression, logistic regression, and
    ANOVA.

2. Prediction:

  • Prediction focuses on using a model to make accurate forecasts or estimates for new,
    unseen data
  • It aims to generalize the model to new instances, based on the patterns observed in the
    training data.
  • Prediction often involves minimizing an error metric, such as mean squared error or cross-entropy loss, to assess the accuracy of the model.
  • The primary goal is to create an accurate and reliable model for predicting outcomes, rather than understanding the relationships between variables.
  • Interpretability may be less important in predictive modelling, as the main objective is to create accurate forecasts rather than understanding the underlying structure of the data.
  • Examples of predictive techniques include decision trees, support vector machines, neural networks, and ensemble methods like random forests and gradient boosting machines.

In summary, inference focuses on understanding the relationships between variables andinterpreting the underlying structure of the data, while prediction focuses on creating accurate forecasts for new, unseen data based on the patterns observed in the training data.

What exactly happens in Multicollinearity (Mathematically?)

When multicollinearity is present in a model, it can lead to several issues, including:

  1. Difficulty in identifying the most important predictors: Due to the high correlation between independent variables, it becomes challenging to determine which variable has the most significant impact on the dependent variable.
  2. Inflated standard errors: Multicollinearity can lead to larger standard errors for the regression coefficients, which decreases the statistical power and can make it challenging to determine the true relationship between the independent and dependent variables.
  3. Unstable and unreliable estimates: The regression coefficients become sensitive to small changes in the data, making it difficult to interpret the results accurately.

Perfect multicollinearity

Perfect multicollinearity occurs when one independent variable in a multiple regression model is an exact linear combination of one or more other independent variables. In other words, there is an exact linear relationship between the independent variables, making it impossible to uniquely estimate the individual effects of each variable on the dependent variable.

In linear regression, the Ordinary Least Squares (OLS) method is used to estimate the coefficients of the independent variables that best fit the data to the dependent variable. When there is multicollinearity, the OLS method can encounter problems.

Mathematically, in a multiple regression model, the OLS method tries to minimize the sum of the squared errors between the predicted values and the actual values. The formula for OLS estimates of the regression coefficients (β) is:

β = (X’X)^(-1) X’y

where X is the design matrix containing the independent variables, y is the vector of the dependent variable, and (X’X)^(-1) is the inverse of the matrix of the independent variables.

When there is multicollinearity, the design matrix X can become nearly or exactly singular, meaning that it is nearly or exactly linearly dependent. This means that the matrix (X’X) becomes non-invertible, and the inverse does not exist. This can result in the OLS estimates of the regression coefficients becoming unstable or even impossible to compute.

Types of Multicollinearities

There are two main types of multicollinearity:

  1. Structural multicollinearity: This type of multicollinearity arises from the way variables are defined or the construction of the model. It occurs when one independent variable can be expressed as a linear combination of other independent variables. For example, if variable A is equal to 2 times variable B plus 3 times variable C, then there is structural multicollinearity.
  2. Data-driven multicollinearity: This type of multicollinearity occurs due to the specific data being analyzed. It arises when the independent variables in the dataset are highly correlated with each other, regardless of how they are defined or the model construction. It is a result of the observed patterns in the data.

Both types of multicollinearity can pose challenges in regression analysis, as they can lead to unstable or unreliable estimates of the regression coefficients and affect the interpretation of the model. Detecting and addressing multicollinearity is important to ensure the validity and accuracy of the regression results.

How to Detect Multicollinearity

multicollinearity.ipynb — Colaboratory (google.com)

Correlation

Correlation is a measure of the linear relationship between two variables, and it is commonly used to identify multicollinearity in multiple linear regression models. Multicollinearity occurs when two or more predictor variables in the model are highly correlated, making it difficult to determine their individual contributions to the output variable.

To detect multicollinearity using correlation, you can calculate the correlation matrix of the predictor variables. The correlation matrix is a square matrix that shows the pairwise correlations between each pair of predictor variables. The diagonal elements of the matrix are always equal to 1, as they represent the correlation of a variable with itself. The off-diagonal elements represent the correlation between different pairs of variables.

In the context of multicollinearity, you should look for off-diagonal elements with high absolute values (e.g., greater than 0.8 or 0.9, depending on the specific application and the level of concern about multicollinearity). High correlation values indicate that the corresponding predictor variables are highly correlated and may be causing multicollinearity issues in the regression model.

It’s important to note that while correlation can be a useful tool for detecting multicollinearity, it doesn’t provide a complete picture of the severity of the issue or its impact on the regression model. Other diagnostic measures, such as Variance Inflation Factor (VIF) and condition number, can also be used to assess the presence and severity of multicollinearity in a regression model.

multicollinearity.ipynb — Colaboratory (google.com)

Variance Inflation Factor (VIF)

Variance Inflation Factor (VIF) is a metric used to quantify the severity of multicollinearity in a multiple linear regression model. It measures the extent to which the variance of an estimated regression coefficient is increased due to multicollinearity.

For each predictor variable in the regression model, VIF is calculated by performing a separate linear regression using that predictor as the response variable and the remaining predictor variables as the independent variables. The VIF for the predictor variable is then calculated as the reciprocal of the variance explained by the other predictors, which is equal to 1 / (1 — R²). Here, R² is the coefficient of determination for the linear regression using the predictor variable as the response variable.

The VIF calculation can be summarized in the following steps:

  1. For each predictor variable Xᵢ in the regression model, perform a linear regression using Xᵢ as the response variable and the remaining predictor variables as the independent variables.
  2. Calculate the R² value for each of these linear regressions.
  3. Compute the VIF for each predictor variable Xᵢ as VIFᵢ = 1 / (1 — R²ᵢ)

A VIF value close to 1 indicates that there is very little multicollinearity for the predictor variable, whereas a high VIF value (e.g., greater than 5 or 10, depending on the context) suggests that multicollinearity may be a problem for the predictor variable, and its estimated coefficient might be less reliable.

Keep in mind that VIF only provides an indication of the presence and severity of multicollinearity and does not directly address the issue. Depending on the VIF values and the goals of the analysis, you might consider using techniques like variable selection, regularization, or dimensionality reduction methods to address multicollinearity.

Condition No.

multicollinearity.ipynb — Colaboratory (google.com)

In the context of multicollinearity, the condition number is a diagnostic measure used to assess the stability and potential numerical issues in a multiple linear regression model. It provides an indication of the severity of multicollinearity by examining the sensitivity of the linear regression to small changes in the input data.

The condition number is calculated as the ratio of the largest eigenvalue to the smallest eigenvalue of the matrix XᵀX, where X is the design matrix of the regression model (each row representing an observation and each column representing a predictor variable). A high condition number suggests that the matrix XᵀX is ill-conditioned and can lead to numerical instability when solving the normal equations for the regression coefficients.

In the presence of multicollinearity, the design matrix X has highly correlated columns, which can cause the eigenvalues of XᵀX to be very different in magnitude (one or more very large eigenvalues and one or more very small eigenvalues). As a result, the condition number becomes large, indicating that the regression model may be sensitive to small changes in the input data, leading to unstable coefficient estimates.

Typically, a condition number larger than 30 (or sometimes even larger than 10 or 20) is considered a warning sign of potential multicollinearity issues. However, the threshold for the condition number depends on the specific application and the level of concern about multicollinearity.

It’s important to note that a high condition number alone is not definitive proof of multicollinearity. It is an indication that multicollinearity might be a problem, and further investigation (e.g., using VIF, correlation matrix, or tolerance values) may be required to confirm the presence and severity of multicollinearity.

How to remove multicollinearity

  1. Collect more data: In some cases, multicollinearity might be a result of a limited sample size. Collecting more data, if possible, can help reduce multicollinearity and improve the stability of the model.
  2. Remove one of the highly correlated variables: If two or more independent variables are highly correlated, consider removing one of them from the model. This step can help eliminate redundancy in the model and reduce multicollinearity. Choose the variable to remove based on domain knowledge, variable importance, or the one with the highest VIF.
  3. Combine correlated variables: If correlated independent variables represent similar information, consider combining them into a single variable. This combination can be done by averaging, summing, or using other mathematical operations, depending on the context and the nature of the variables.
  4. Use partial least squares regression (PLS): PLS is a technique that combines features of both principal component analysis and multiple regression. It identifies linear combinations of the predictor variables (called latent variables) that have the highest covariance with the response variable, reducing multicollinearity while retaining most of the predictive power.

--

--

ajaymehta

Meet Ajay a blogger and AI/DS expert. Sharing insights on cutting-edge tech, machine learning, data analysis, and their real-world applications.