Isaac Adegbayibi
dsnaiplusui
Published in
6 min readJul 4, 2020

--

Lesson 04: Correlation between Variables.

Image from investopedia.com

In a study, Exploratory Data Analysis is performed and an interest in studying the relationships of some variables involved in the study arises! Then, a very important question comes for an answer: Can one statistic measure both the strength and direction of a linear relationship between two variables? Sure! In Statistics, we use the coefficient of correlation to measure the strength and direction of the linear relationship between two numerical variables, say, X and Y. The correlation coefficient for a sample of data is denoted by r.

In this lesson, the concept of correlation will be looked at, starting from its fundamental theory to its application in machine learning.

Nexus between Correlation and Machine Learning.

It is no longer news that AI+Club, University of Ibadan is training members on Machine Learning to use the Jupyter Notebook! Correlation in Statistics is important to further complete our understanding of the model built through Machine Learning.

Data correlation is how one set of data may correspond to another set. In Machine Learning, think of how your features correspond with your output.

For example, the image below visualizes a dataset of brain size versus body size. Notice that as the body size increases, so does the brain size. This is known as a linear correlation. With a simple definition of linear correlation, the data follows a straight line.

Figure 1: Brain Weight Vs. Body Weight.

Theory.

Correlation describes the strength of the linear relationship between two variables. A numerical value, called the coefficient of correlation which always takes a value between -1 and +1, tells us about the relationship between the two variables.

Correlation between two variables can be in three instances. We can get a scatter plot from our EDA on the two variables to show us the general direction and nature of the relationship between the two variables. The three instances that can be observed are:

  1. Positive Correlation: Both variables change in the same direction. This implies that for every increase in the independent variable, there is a corresponding increase in the value of the dependent variable.
  2. Neutral/Zero Correlation: No relationship in the change of the variables. Here, there is no relationship between the two variables, they are independent of one another. Also, it can be said that the independent variable cannot explain or predict the dependent variable in the regression model.
  3. Negative Correlation: For this instance, it tells us that there is an inverse proportional relationship between the variables. This implies that as the values in one variable (independent or dependent) increases, there is a decrease in the other variable.

The coefficient of correlation for a simple linear regression model (Y = a +bX) is usually denoted by r and we read its values as thus:

  1. The relationship is perfectly linear when R is either -1 or +1.
  2. If the relationship is strong and positive, R will be near +1 (close to +1).
  3. If the relationship is strong and negative, R will be close to -1.
  4. If there is no apparent linear relationship, R will be close to 0.

Illustration.

Figure 1: Image showing correlation instances between two variables, say, Y and X.

For simplicity, let’s consider the image above. Two variables are under study in each of the six sub-images above and all six sub-images are obtained from the EDA of the variables.

  • Starting from the top left, it is observed that there is a strong, positive correlation between the two variables as r = 0.7. For the bottom left, that scenario is the inverse of the former as there exists a strong, negative correlation between the variables as r =-0.7.
  • For the centreimages, the top centreshows us a weak, positive correlation, close to zero with r = 0.3. The bottom centre is the reciprocal of that, showing weak, negative correlation, r = -0.3.
  • The sub-images to the right are obviously the same! Both are showing us when the two variables aren’t related. This is Zero/Neutral Correlation with r = 0.

That’s it for the illustration!

Computation.

An easy, well-used approach to compute the mathematical values of the coefficient of correlation is Pearson’s product-moment coefficient, defined below:

The population correlation coefficient,

between two random variables

and

with expected values

and

and standard deviations

and

is defined as

where

is the expected value operator,

means covariance, and

It is a widely used alternative notation for the coefficient of correlation. The Pearson correlation is defined only if both standard deviations are finite and positive.

We do not need to memorize the formulas here as some lines of code in our console can whip up the values in a very short runtime.

Multicollinearity.

Most real-life scenarios would have multiple independent variables trying to explain or predict a dependent variable.

The technique used for the estimation that is derived from the Simple Linear Regression is Multiple Regression.

But, in Multiple Regression, there is always a problem of the independent variables being correlated and even showing a stronger relationship between themselves than they should with the dependent variable, this problem is known as Multicollinearity.

A proper definition: Multicollinearity is a situation when there violates a major assumption of the classical linear regression model which holds that there is no intercorrelation between two or more independent variables. This leads to negative consequences on the estimated regression coefficients when this holds.

The following are major sources of multicollinearity:

  • An overdetermined model: Multicollinearity exists here because the model has much more independent variables than the number of observations in the sample data set used for estimation or prediction.
  • Analyzing time series models: In time series analysis, the problem of autocorrelation is more pronounced than multicollinearity, and time-series data is unique because an independent variable involved is time. If time is included as an independent variable in your model, you would have to use time series methods for your analysis or you would run into severe multicollinearity.
  • Data collection method: sampling over a limited range of values for the independent variables in the population can cause multicollinearity as some parts of the population are not represented in the sample.
  • Other sources are constraints on the model or in the population being sampled, model specification error, etc.

Consequences of Multicollinearity are:

  • The Ordinary Least Squares estimators of the parameters of the model which are the best, linear, unbiased estimators have large variances and covariances which makes precise estimation difficult.
  • The covariance of estimates of parameters of collinear variables will be significantly larger.
  • OLS estimators and their standard errors can be sensitive to very small changes in the sample's data, etc.

Ways to remove/reduce multicollinearity are:

  • Increase the sample size.
  • Combining cross-sectional data with time-series data.
  • Transforming the variables into a much simpler form for estimation, etc.

Thank you for reading! Happy learning from myself and DSN AIPlus UI .

--

--

Isaac Adegbayibi
dsnaiplusui

Final-year Student of Statistics at the University of Ibadan.