# Understanding Value Of Correlations In Data Science Projects

## Explore The Heart Of Data Science. It’s Crucial To Understand The Significance Of Calculating Correlations

Every single successful data science project revolves around finding accurate correlations between the input and target variables. However more than often, we oversee how crucial correlation analysis is. It is recommended to perform correlation analysis before and after data gathering and transformation phases of a data science project.

This article focuses on the important role correlations play in the data science projects and concentrates on the real world FinTech examples.

Lastly it explains how we can model the correlations the right way.

We Are Going To Explore The Heart Of Data Science. Understanding how important correlations are is crucial for every data scientist.

# Article Aim

I will explain following three key areas:

- What is correlation?
- Why we need to understand correlations with real life examples
- How to calculate correlations in Python

# What Is Correlation?

Correlation is a statistical measure.

Correlation explains how one or more variables are related to each other. These variables can be input data features which have been used to forecast our target variable.

Two features (variables) can be positively correlated with each other. It means that when the value of one variable increases then the value of the other variable(s) also increases.

## Example Of Strong Positive Correlation

The trend line has a positive gradient.

Two features (variables) can be negatively correlated with each other. This occurs when the value of one variable increases and the value of other variable(s) decreases.

## Example Of Strong Negative Correlation

Two features might not have any relationship with each other. This happens when the value of a variable is changed then the value of the other variable is not impacted.

## Example Of No Correlation

Correlation Is An Under-Rated Statistical Measure

# Let’s Understand How Important Correlations Are In Real World

I am going to present ten real world use cases which will elaborate how important it is to understand, model and measure the correlations accurately and on timely basis. The aim is to illustrate how important correlation analysis is.

It Is Extremely Important To Perform Correlation Analysis

# Real World Use Case 1

Let’s imagine you lend a large sum of money to a company named ABC for a year. ABC promises to give you your money with interest back in a years time. You are worried that the company ABC might default and to protect yourself from that risk, you decide to buy insurance from an insurance company named XYZ.

Now let’s also assume that everyone who has lent money to ABC has also bought the insurance from the insurance company XYZ.

Can you see what will happen if ABC defaults?

If ABC defaults then everyone will reach out to XYZ and expect to get their money back from them. As a consequence, XYZ might default and you would lose your money.

This is because there is a

strong positive correlationbetween the companies ABC and XYZ.

If we knew the correlation up-front, we would have bought insurance from another company and would have saved ourselves from losing the money!

**What I have just explained above is the concept of financial trade known as CDS and the risk is known as WWR Correlation Risk.**

# Real World Use Case 2

Sometimes we are trying to forecast a variable y e.g. stock price, and we spend a huge amount of time gathering data for the features x1 (company sales) and x2 (company revenue) that can help us forecast the variable y. However the two features might be strongly positively correlated with each other.

Thus it is suffice to only gather data for one of the features and feed it into our data science model. Not only it can save us effort in gathering and cleaning the data, it can also speed up the time it takes to train the model.

**Therefore it is crucial to model the correlation between the features as it can save us from wasting valuable time.**

# Real World Use Case 3

## Did you know what happened during 2007 financial crises? Correlations played the biggest role in the financial crises.

During the crises, correlations across the global markets were extremely positive. As a result, the assets across the world fell down together.

During recession, the correlation between assets completely change.

The correlations for equity and senior tranches increased significantly. This meant that the losses in one tranche caused losses in the other tranche. It was not expected at all.

It’s important to model the correlation and calculate it on continuous basis.

# Real World Use Case 4

As Euro was devalued in 2012, US exporters experienced losses.

When GDP of US was low then Asian and European exporters suffered losses due to the strong correlation between the markets.

It is apparent that knowing about macro level correlation can help us take better investment decisions.

# Real World Use Case 5

Oil prices were very high during the Middle East uprising. As as a consequence, airline travel was decreased and it impacted tourism industry in the region badly.

When correlation is modelled accurately and measured frequently then it can help us plan better from unforeseen scenarios.

# Real World Use Case 6

The price of commodities such as precious metals is negatively correlated with the interest rates. When the interest rates increase then commodity prices decrease.

The measurement of correlation can help us cut the costs and increase the profits.

# Real World Use Case 7

The famous investment theory of Harry Markowitz relies around the concept of calculating correlations to model the co-movements of the assets. A number of correlation trading strategies (Quanto Strategy) have been invented by the traders. Successful investors and analysts always attempt to analyse the correlations.

A large number of financial institutions rely on the concept of correlations. We do not want to put all of our eggs in one basket, implying that we do not want to invest in all of those assets that co-move together in the same direction.

# Real World Use Case 8

Risk management relies on the exercise of finding the covariance between the assets to model how the assets move with each other.

A large number of hedging strategies are dependent on finding the correlations between the trade and the hedged position.

Special trades have been designed that model the correlation risk, such as correlation swaps and correlation options.

# Real World Use Case 9

VaR is one of the key risk management tool that helps us find the maximum loss over a holding period for a confidence level. VaR can be calculated using the Delta-Normal approach. Delta-normal approach is also known as variance-covariance approach as it relies on finding the variance-covariance of the assets. Usually a covariance or correlation matrix is fed into the calculation.

The core of capturing risk in markets is dependent on finding accurate correlations.

# Real World Use Case 10

Lastly, I am going to touch up on an important use case. Bonds, interest rates, credit spreads, stock prices and their returns are all assumed to eventually revert back to their mean value. All of these variables are known as mean-reverting variables. Sometimes the variables are correlated to their past values. Here, the correlation (auto-correlation) measures how strongly correlated the current and past values are to each other. A number of models such as ARCH and GARCH have been implemented to estimate the autocorrelation. These models specialise in finding auto-correlations and have been used extensively in the data science world.

If A Successful Data Science Project Is Required To Be Implemented Then One Simply Cannot Ignore Correlations

# Now that we understand how important it is to measure correlation, let’s have a look at different techniques which can help us calculate correlation coefficients.

I am going to focus on the three popular correlation measures:

- Pearson correlation measure

2. Spearman rank correlation measure

3. Kendall correlation measure

I will be explaining how to calculate each of them and what their limitations are.

# 1. Pearson Correlation

Pearson correlation measures the **linear **relationship between the variables. It assumes that the variables are normally distributed.

The Pearson correlation is calculated by dividing the covariance of the two variables by the product of their standard deviations. Covariance measures how the two variables move with each other over time. As we divide the covariance by the standard deviations, we make the Pearson correlation unit-less and hence it is always between the values -1 and 1.

- The biggest limitation of Pearson correlation is that it assumes that the variables have linear relationship between them. Most of the variables do not have linear relationships. As an instance, the financial assets have a non linear relationship between them.
- When the value of Pearson correlation is 0, it means that there is no linear relationship between the two variables. However, there could be a non-linear relationship between the variables. Hence the value 0 does not imply that the two variables are completely independent of each other.
- The variance of the variables is expected to be finite. This is not the case most of the times, as an instance when the distribution is Student-t.
- The Pearson correlation is changed once we transform the data. Often, in the data science projects, we calculate the log of a variable to transform it into a linear variable. The side effect of it is that the Pearson correlation will also change.

To compute Pearson correlation in Python:

`scipy.stats.pearsonr(variable1`**, variable2**)

*variable1 and variable2 can be arrays.*

# 2. Spearman Ranking Correlation

Sometimes the elements in our data sets have orders. This is particular common in time series data. In those instances, we can calculate the Spearman ranking correlation measure to find the relationship of the ranked variables.

There are three steps to calculate the Spearman rank correlation:

If there are two variables X and Y

1. Order the set pairs of variables X and Y with respect to the set X.

2. Determine the ranks for each time period i.

3. Compute the difference of the ranks and square the difference.

The correlation will be 1 for perfectly positively correlated variables, -1 implies that the variables have perfect negative correlation and 0 means that there is no correlation between the variables.

The variables are not required to have normal distribution.

We can compute the Spearman ranking correlation in Python:

`scipy.stats.spearmanr(variable1`**, variable2***)*

variable1 and variable2 can be arrays.

# 3. Kendall Correlation Measure

The last important correlation measure is Kendall Tau. Kendall correlation measure is known as Kendall Tau measure. It is a nonparametric measure that does not require any assumptions regarding the joint probability distributions of variables.

Kendall Tau measures the correspondence between the two rankings. We can implement Kendall Tau in Python:

`scipy.stats.kendalltau(variable1`**, variable2***)*

# Pandas Is Great

If you load your data into a Pandas dataframe then you can call a ready-made function in Pandas that can calculate the correlation between every single variable for you.

`df = pd.DataFrame(..)`

df.corr(method)

The parameter method could be *{‘pearson’, ‘kendall’, ‘spearman’}*

If you want to explore Pandas then read my article:

# Summary

This article explained what correlations are, how important they are and the significant role they play.

Finally it explained how we can compute them in Python.

Although the correlation analysis is under-rated but we can see how important it is to measure the correlation and use it wisely in your data science projects.

Hope it helps.