# Covariance & Correlation

**Covariance**

**Introduction:**

Covariance is a measure of how changes in one variable are associated with changes in a second variable. Specifically, covariance measures the degree to which two variables are linearly associated. However, it is also often used informally as a general measure of how monotonically related two variables are.

In probability theory and statistics, covariance is a measure of how much two random variables change together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive. For example, as a balloon is blown up it gets larger in all dimensions. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. If a sealed balloon is squashed in one dimension then it will expand in the other two. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

Variables whose covariance is zero are called uncorrelated variables.

Covariance can be calculated as

Where E(X) = mean of variable X

E(Y) = mean of variable Y

Various tools have function or functionality to identify covariance between variables. In Excel, function COVAR() is used to return the covariance between two variables and SAS uses procedure PROC COV to identify the covariance.

Applications:

- By using covariance, a portfolio manager can identify if the portfolio is adequately diversified

**Correlation**

**Introduction:**

Correlation refers to the extent to which two variables have a linear relationship with each other. It is a statistical technique that can show whether and how strongly variables are related. It is a scaled version of covariance and values ranges from -1 to +1.

It can be calculated as

**Difference between Covariance and Correlation:**

In probability theory and statistics, the mathematical concepts of covariance and correlation are very similar. Both describe the degree to which two random variables or sets of random variables tend to deviate from their expected values in similar ways.

If X and Y are two random variables, with means μX and μY, and standard deviations σX and σY, respectively, then their covariance and correlation are as follows:

Where E is the expected value operator. Notably, correlation is dimensionless while covariance is in units obtained by multiplying the units of the two variables. The correlation of a variable with itself is always 1 (except in the degenerate case where the two variances are zero, in which case the correlation does not exist).

**Ways to detect Correlation between variables:**

1. Graphical Method: While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.

Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. To find the strength of the relationship, we use statistical technique.

2. Non-graphical method: Build the correlation matrix to understand the strength between variables. Correlation varies between -1 and +1.

a. -1: Perfect negative linear correlation

b. +1: Perfect positive linear correlation

c. 0: No correlation

**Ideal assumptions:**

1. High Correlation between dependent and independent variable.

2. Less correlation between independent variables.

Generally, if the correlation between the two independent variables are high (>= 0.8) then we drop one independent variable otherwise it may lead to multi collinearity problem. Various tools have function or functionality to identify correlation between variables. In Excel, function CORREL() is used to return the correlation between two variables and SAS uses procedure PROC CORR to identify the correlation. These function returns Pearson Correlation value to identify the relationship between two variables.

# Multicollinearity

**Introduction:**

Multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables (Independent variables) in a regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them.

**Types of multicollinearity:**

There are two types of multicollinearity:

1. Structural multicollinearity is a mathematical artifact caused by creating new predictors from other predictors — such as, creating the predictor *x*2 from the predictor *x*.

2. Data-based multicollinearity, on the other hand, is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.

**Detect Multicollinearity:**

Indicators that multicollinearity may be present in a model include the following:

1. Large changes in the estimated regression coefficients when a predictor variable is added or deleted

2. Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the joint hypothesis that those coefficients are all zero (using an F-test)

3. If a multivariable regression finds an insignificant coefficient of a particular explanator, yet a simple linear regression of the explained variable on this explanatory variable shows its coefficient to be significantly different from zero, this situation indicates multicollinearity in the multivariable regression.

4. VIF (Variance inflation factor) can be used to detect multicollinearity in the regression model{\displaystyle \mathrm {tolerance} =1-R_{j}^{2},\quad \mathrm {VIF} ={\frac {1}{\mathrm {tolerance} }},}

**Why is this problem?**

• Collinearity tends to inflate the variance of at least one estimated regression coefficient.

• This can cause at least some regression coefficients to have the wrong sign.

**Ways of dealing with collinearity**

· Ignore it. If prediction of y values is the object of your study, then collinearity is not a problem.

· Get rid of the redundant variables by using variable sélection technique.

There are multiple techniques to select variables which are less correlated with high importance

1. Correlation method

2. PCA (Principal Component Analysis)

3. SVD (Singular value Decomposition)

4. Machine learning algorithms (Random Forest, Decision trees)

# Remedies for multicollinearity

1. Drop one of the variables. An explanatory variable may be dropped to produce a model with significant coefficients. However, you lose information (because you’ve dropped a variable). Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable.

2. Obtain more data, if possible. This is the preferred solution. More data can produce more precise parameter estimates (with lower standard errors), as seen from the formula in variance inflation factor for the variance of the estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity.

3. Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to the whole data set. Theoretically you should obtain somewhat higher variance from the smaller datasets used for estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient values will vary, but look at how much they vary.

4. Standardize your independent variables. This may help reduce a false flagging of a condition index above 30.

5. It has also been suggested that using the Shapley value, a game theory tool, the model could account for the effects of multicollinearity. The Shapley value assigns a value for each predictor and assesses all possible combinations of importance.

6. If the correlated explanators are different lagged values of the same underlying explanator, then a distributed lag technique can be used, imposing a general structure on the relative values of the coefficients to be estimated.