Relationship aka Correlation between Data

Charles Patel
DataDrivenInvestor
Published in
6 min readAug 30, 2020

--

In machine learning and data science, data exploration or data analysis is the primary step in building an effective model. While doing data analysis, first we need to understand the data and its purpose then understand the nature and distribution of data, whether the data is Gaussian (normal) distributed or distribution-free, then analyze or think about what statistical methods can be used to extract useful information which can be used building the model. This helps us in making a better model. Additionally, by doing so, we can filter undesired features and only allow those features that provide useful information to the model in turn reducing the model complexity.

Once we figure out the basic property mentioned above for a feature, then we can apply statistical methods accordingly. It is comparatively easy to extract and understand the statistic of a single feature. But what if we want to understand the statistical relationship between two or more features?

To find a relationship between features, we need to calculate the correlation coefficient and perform some tests to extract valuable information. To apply such metrics and tests, we should determine the compatibility between such methods and the types of data. I’ll cover all these details in this post.

The features in our dataset can be dependent or independent of each other for various reasons. In order to find out to what extent they are related to each other, we often use covariance and correlation metrics or some statistical methods to quantify the relationship among them.

Covariance

The relationship between the two features can be found by calculating the covariance. Here’s the formula to calculate the covariance between two features:

Covariance alone does not give away any useful insights. So, often to strengthen the covariance method, we generally use different correlation methods as described below.

Correlation

The relationship or correlation between features represent as

In this blog, we will cover all popular correlation methods and try to understand when to use these methods on different types of data.

There are two primary types of correlation methods:

  • Parametric correlation
  • Non-parametric correlation

While finding a correlation between features, if features have gaussian distribution or linear relationship we apply parametric correlation methods whereas if the feature is distribution-free or the distribution is unknown, we apply non-parametric correlation methods. A heatmap will help us better visualize the correlation between all features.

Heatmap
Heatmap

As a general rule of thumb, when the features are nominal (categorical) or ordinal, then a non-parametric test should be selected and when the features are continuous, then a parametric test should be selected.

Parametric correlation: Parametric tests assume that the features you are working with are normally distributed or follow a bell curve. The most frequent parametric test to examine for the strength of association between two features is

  • Pearson correlation

Non-parametric correlation: Non-parametric tests are referred to as distribution-free tests because they do not follow any assumption in regards to the distribution of the data. Some of them are listed below:

  • Spearman’s Rank Correlation
  • Kendall’s rank correlation coefficient
  • Goodman and Kruskal’s Rank Correlation
  • Somers’ Rank Correlation

Pearson correlation

A Pearson correlation is used when assessing the relationship between two continuous variables. Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation.

Pearson’s Correlation

The coefficient returns a value between -1 and 1 which shows a negative correlation to positive correlation respectively, A coefficient value of 0 means no correlation. The coefficient < -0.5 and > 0.5 represent significant correlation between features. A perfect positive correlation is +1 and a perfect negative correlation is -1.

Correlation between two features that has a normal distribution (Gaussian) can be calculated as the Pearson correlation. But the features that do not follow a normal distribution (categorical) we generally use rank correlation. Let’s understand the Rank Correlation.

What is the Rank correlation?

Rank correlation measures the ordinal relationship between features or the relationship between the rankings of different ordinal variables.

Rank correlation can be calculated as ranking 1 to the biggest number in a column, 2 to the second biggest value, and so on and for equal scores, the rank will be mean (average) rank. Rank correlation coefficients can then be calculated in order to represent the association between the two ranked variables.

Rank correlation methods are referred to as non-parametric correlation or distribution-free correlation. To know more about how to calculate ranks, take a look here.

Spearman’s Correlation

It measures the strength and direction of the association between two ranked variables. It is used when the assumptions of the Pearson correlation is violated. The equation for Spearman’s Correlation is shown below,

Spearman’s Correlation

where D, is the difference between rank and n, i is the number of pairs

It calculates the Pearson’s correlation but uses ranks of the feature. Here’s the example to calculate Spearman’s Correlation coefficient.

Kendall’s Rank Correlation

It calculates a normalized score for the number of matching or concordant rankings between the two features.

Kendall’s Tau = C - D / C + D

where C is the number of concordant pairs and D is the number of discordant pairs.

The concordant pairs are the number of larger ranks below a certain rank and discordant pairs are the number of smaller ranks are below a certain rank.

Intuitively, the Kendall correlation between two features will be high when observations have a similar rank between the two features and low when observations have a dissimilar rank between the two features. Here’s the example to calculate Kendall’s rank correlation coefficient.

Goodman and Kruskal’s Rank Correlation

It measures the strength and direction of association that exists between two ordinal features. It is based on two assumptions, one being two features should be measured on an ordinal scale and the other one states that the features should be in a monotonic relationship

It is useful when there are many ties between features. A tie is when comparing two features both values are the same.

I have avoided the formula and its explanation for simplicity. You can check out the example to see how to calculate Goodman and Kruskal’s Rank Correlation coefficient.

Somers’ Rank Correlation

It is a measure of association between an ordinal dependent variable and an ordinal independent variable. Somers’ Rank Correlation is based on the assumption that one feature will be ordinal dependent and others will be ordinal independent

Somers’ Rank Correlation is appropriate when you want to distinguish between a dependent and independent variable, Goodman and Kruskal’s gamma does not make any distinction between the two ordinal variables.

Wrapping up, we covered different correlation coefficient for parametric and non-parametric correlation between two features. In subsequent blog posts, I will cover testing methods such as t-test, ANOVA, chi-square test, etc. for more than two features. In the meantime, if you are interested in learning more about machine learning you can check out my other blogs. Thanks :)

Reference

Gain Access to Expert View — Subscribe to DDI Intel

--

--