Introduction to Correlation in Data Analysis

Piyush Kumar
Analytics Vidhya
Published in
3 min readApr 15, 2020

“Let’s perform some analysis and visualization on data ”

What is the correlation?

It is the statistical technique for measuring whether the different variables are independent or not to what extent. In other words, observing the behavior of variables over time, if one variable changes how other variable gets affected.

Let’s take an example and understand its graph

The dataset for which we are performing this analysis is COVID-19 which is given by JHU CSSE.

We can calculate the correlation between variables of types int64 or float64 using the method and plotting heatmap of correlation of all variables:

df.corr().style.background_gradient(cmap='Reds')

Now, let’s find the correlation between the following columns:
Confirmed, Deaths, Recovered, Active and plotting heatmap of correlation.

df[["Active", "Confirmed", "Recovered", "Deaths"]].corr().style.background_gradient(cmap='Reds')

Now we’ll visualize these two variables using a scatter plot and added a linear line called regression line, which indicates the relationship between the two variables.

sns.regplot(x=”Active”, y=”Confirmed”, data=df_covid)
plt.ylim(0,)

So, as you can see that the straight line through the data points is very steep which shows that there’s a positive linear relationship between the two variables.
Assuming that if the line is going down then there’s a negative linear relationship between the two variables.

Correlation-Statistics

One way to measure the strength of the correlation between continuous numerical variables is by using a method called Pearson correlation.

Pearson Correlation: It is the method to measure the strength of the correlation between continuous numerical variables.
It gives two variables that are:
1. Correlation coefficient
2. P-value

Let’s talk about them and draw some conclusions:

Correlation-Coefficient: It gives the relationship between the variables.
if value,

close to 1 → Large positive relationship
close to -1 → Large negative relationship
close to 0 → No relationship (if, x=y)

P-Value: how certain we are about the correlation that we calculated.
if value,

P-Value<0.001 → Strong certainty in the result.
P-Value<0.05 → Moderate certainty in the result.
P-Value<0.1 → Weak certainty in the result.
P-Value>0.1 → No certainty in the result.

Strong Correlation:

correlation coefficient close to 1 or -1
→P-value less than 0.001

The following plot shows data with different correlation values.
from scipy import statspearson_coef, p_value = stats.pearsonr(df['Active'], df['Confirmed'])print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
df_2 = df_covid[["Active", "Confirmed"]]
f, ax = plt.subplots(figsize=(7,7))
plt.xlabel("Active")
plt.ylabel("confirmed")
plt.title("Heatmap b/w Active and Confirmed")
plt.pcolor(df_2, cmap = 'RdBu')
plt.show()

Conclusion

In this article, we have completed the basic introduction of Correlation in Data Analytics. I hope you have learned something from it.
Stay tuned for more updates. If you have some doubts then the comment section is all yours. I’ll try my level best to answer your questions. If you liked this article and learned something from it then do leave a clap.

Thank you.

--

--