Beginners Guide to Pairwise Correlation — Pearson Correlation Coefficient — Part 2

Lewis Fantom
6 min readJun 4, 2023

--

In the previous article we tackled the basics of Pairwise Correlation — Pearson Correlation Coefficient and you can read that article HERE. In this article we will do the FUN stuff and create some Pairwise Correlation Heatmaps using Python.

Brief reminder of Pairwise Correlation

Pairwise correlation refers to the statistical assessment of the relationship between two variables. This analysis is valuable for understanding patterns, dependencies, and potential cause-and-effect relationships. It does this by quantifying the correlation using measures like Pearson’s coefficient, Spearman’s coefficient, or Kendall’s coefficient. By analysing pairwise correlation, data analysts gain insights into how changes in one variable relate to changes in another. Overall, correlation analysis, including pairwise correlation, is an important tool that enables data analysts to uncover meaningful patterns and insights from data and comprehensively measure and comprehend the associations between pairs of variables in a dataset.

For this analysis we will be using Pearsons Coefficient. Explanation of this can be found in the following article

The Data Set

Your data set to conduct a pairwise analysis having your data in the correct format is key. To conduct a pairwise analysis, you typically need a dataset that contains multiple variables or features. Each variable should have numerical values that can be used to calculate correlations.

The dataset should be structured as a table or a matrix, where each row represents an observation or data point, and each column represents a different variable or feature.

The Code

For this analysis I will be using Google Colab.

Here is a Google Colab file that will help you along your way!

Load the Data

from google.colab import files
uploaded = files.upload()

Pull the data into Google Colab

where table_name change for the file name of your file

import pandas as pd
df = pd.read_table ('table_name.csv',header=0,sep=',')
print(df.head())

Pairwise Correlation Heatmap

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="coolwarm",annot=True)

Code explanation for beginners to python:

plt.figure(figsize=(10,5))

This line creates a new figure object using matplotlib.pyplot. The figsize parameter is used to specify the width and height of the figure in inches. In this case, the figure size is set to 10 inches in width and 5 inches in height.

c = df.corr()

This line calculates the correlation matrix of the DataFrame df using the corr() method provided by pandas. The correlation matrix is a square matrix that contains the pairwise correlations between the columns of the DataFrame.

sns.heatmap(c, cmap=”coolwarm”, annot=True)

This line creates a heatmap using the seaborn library.

  • The heatmap() function generates a color-coded matrix representation of the correlation matrix c.
  • The c parameter specifies the correlation matrix,
  • cmap parameter sets the colormap to “coolwarm” for visual representation.
  • annot=True adds numerical annotations to each cell of the heatmap, displaying the correlation values.

And thats it really — Simple right!

Real World Example

Challenge: Provide valuable insights and help in understanding the relationships between different user behaviours across the website.

Why would you want to do this analysis: (I will focus on 2, but there are a few more)

  1. Identifying User Behavior Patterns: Correlation analysis can help identify patterns and associations between different user actions on a website. By examining correlations, you can understand how certain actions or behaviors tend to occur together.
  2. Informing Experience Optimization Strategies: Understanding the correlations between user actions and conversion metrics can be valuable for experience optimization. By analyzing correlations, you can identify the key actions or behaviors that are strongly associated with higher conversion rates. This insight can guide the design of user flows, call-to-action placements, or content optimization strategies to drive conversions.

Data Set

The following data set is a table of User Actions on a website. (The numbers are theoretical)

For this example I will only be using 20 observations (rows), but it is always preferable to use as may observations that is sensibly possible to: Increase Statistical Power, Improved Precision, Enhance Detection of Small effects etc.

Step 1 — Choose .csv file to pull into Google Colab.

from google.colab import files
uploaded = files.upload()

Step 2 — Read the table and view the top 4 observations to confirm upload

import pandas as pd
df = pd.read_table ('website_user_behaviour_features.csv',header=0,sep=',')
print(df.head())

Step 3 — Perform the Correlation analysis

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="coolwarm",annot=True)

Voila! 1 Pairwise Correlation Heatmap using Pearsons Coefficient

A few observations

One thing you notice when you observe the heat map are:

  1. The red diagonal blocks with the r=1 value.
  2. The symmetry of the heat map

The Red red diagonal blocks show when the feature is evaluated against itself. The symmetry is because of the nature of the pairwise correlation. Because its a matrix, each feature will evaluate against each other twice, giving you the same results on each side of the diagonal red boxes in the heat map.

We can remove the one half of the heat map by adding an additional line of code to our Heatmap

We can add in the mask parameter is optional and can be used to mask or hide certain cells in the Heatmap.

Not the colour distribution will alter slightly as the red blocks have now been removed.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 5))
c = df.corr()
sns.heatmap(c, mask=mask, cmap="coolwarm", annot=True)
plt.show()

Interpreting the Results

Focusing on navigation_click in the first column of the heatmap. The feature has Weak Positive direction and strength when looking at call_me_back & contact_form_submission and Weak Negative direction and strength for home_page_cta and verging on Moderate Negative direction and strength when looking at product_click.

Things to know when advancing further on your Correlation Analysis Journey

  1. Correlation does not imply causation: A high correlation between two variables does not necessarily mean that one variable causes the other. Correlation simply measures the strength and direction of the relationship between variables but does not provide information about the underlying cause.
  2. Statistical significance: Statistical Significance is missing from what we have been doing so far. It is essential to consider the statistical significance of the correlation coefficient. In statistics we use something called a p-value to understand statistical significance when observing the correlation results. In other words whether the observed correlation is statistically significant, or occurred by chance.
  3. Outliers: Outliers can have a significant impact on correlation results. It is important to examine the data for any extreme values that might be influencing the correlation. Removing outliers can be useful in such cases.
  4. Nonlinear relationships: Correlation measures the linear relationship between variables. If the relationship is nonlinear, the correlation coefficient may not fully capture the nature of the association. Exploring alternative correlation measures or considering nonlinear regression models may be necessary in such situations.
  5. Context and domain knowledge: Understanding the context and having domain knowledge is crucial when interpreting correlation results. Consider the variables involved, their relevance to the problem, and any underlying factors that may influence the relationship.

Disclaimer

I do not consider myself to be A Data Scientist, I am simply a Data Analyst on a journey to learn. The best way to learn is to teach others. That is my mission. “Self Learning by teaching others along the way”. If anything in this article is incorrect or you thing there is more to learn and improve, your feedback is greatly appreciated.

--

--