How to Evaluate Relatedness Between Categorical Variables Using the Seaborn Library

Published in

Omics Diary

5 min readNov 2, 2021

Correlations are simple to evaluate between numeric variables using scatterplots, but how about categorical variables?

Scatterplots are great visualisation tools to assess relationships and associations between numeric or continuous variables. However, using data points to evaluate categorical variables may not be as straightforward.

Consider a common scenario where a researcher wants to find out in a microarray (containing ~20,000 transcripts) whether experimental condition A elicits the same gene expression profile as condition B. Plotting strip plots or box plots to visualise all gene expression differences and trends will be challenging as there is a large number of data points.

As discussed previously, clustergrams or heatmaps could be another alternative to visualise gene expression differences. However, these charts do not provide statistics to measure if the trends in gene expression differences are similar or different.

To get around these limitations, correlation matrices and pair plots can be used, both of which can be plotted with the Seaborn library. If working from raw values, you will need to normalise your data by calculating log2–transformed fold-change (log2FC) values with respect to control/placebo for static data. If data is temporal, then the log2FC can be calculated with respect to time = 0 (baseline).

To plot correlation matrix and pair plots using Python, we first load the required packages. In this blog entry, we will be using the Seaborn and matplotlib library:

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Similar to my previous blog entries, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Merck Ad5/HIV vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

In this specific blog entry, we will analyse the correlation (or relatedness) between the different time points after Merck Ad5/HIV vaccination.

We will load and inspect the processed dataframe from GitHub. It is important to label the gene column as the index column for reference. The commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

As the log2FC values approximate to a normal or lognormal distribution, these values are most suitable to use for correlation between categorical variables. We will thus tabulate the log2FC values and filter the dataframe (df_log2FC) that contains log2FC values for the respective time points.

df['log2FC_6h'] = np.log2(df['ratio_6h'])
df['log2FC_1d'] = np.log2(df['ratio_1d'])
df['log2FC_3d'] = np.log2(df['ratio_3d'])
df['log2FC_7d'] = np.log2(df['ratio_7d'])df_log2FC = df.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])
df_log2FC

Output file is thus as follows:

To tabulate the correlation coefficient between the different time-points, the code is as follows:

corr = df_log2FC.corr()
corr

Output showing the correlation coefficients are:

The data suggest the gene signatures in day 1 is most similar to day 3. Interestingly, the signatures at 6 hours post-vaccination is also similar to day 7.

Next, we can evaluate the p-value of the correlation, to test the significance of the correlation. We import SciPy and execute the commands are as follows:

from scipy import stats 
from scipy.stats import pearsonrpvals = pd.DataFrame([[pearsonr(df_log2FC[c], df_log2FC[y])[1] for y in df_log2FC.columns] for c in df_log2FC.columns],
 columns=df_log2FC.columns, index=df_log2FC.columns)
pvals

Output is as such:

Note that all the correlations are significant, probably because of a large number of data points considered for statistical analysis.

To visualise these correlation coefficients in a correlation matrix, we can use the following commands:

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style('white'):
 f, ax = plt.subplots(figsize=(10, 7))
 ax = sns.heatmap(corr, cmap='vlag', mask=mask, center=0, square=True, linewidths=2, annot=True, cbar_kws={'shrink': .5})

I will briefly describe the commands above. The commands add a mask on the top half of the correlation matrix and correlations between the same variables so that users can concentrate on the comparisons on the lower half of the plot. I have also defined the figure size, the colour map used (range from blue to red, where blue is negative correlation and red is a positive correlation), and centred the correlation values at 0 (white). The annotations (annot) provide the correlation coefficient values in the graphs and the lines (linewidth) allow us to separate the squares more nicely.

The output file for the correlation matrix is displayed below:

To allow us to see the points that make up the correlation matrix, we can use the commands as follows to plot a pair plot:

g = sns.pairplot(df_log2FC)
g.map_lower(sns.regplot)

Note that the lower half of the pair plots will contain the regression plot for us to visualise the trend and slopes more clearly. This is particularly important in this case as there are a large number of data points. Output file is as follows:

For easy referencing, the full set of codes are as follows:

And there you have it. Thanks for reading.

Kuan Rong Chan

If you enjoyed this, follow me on medium for more

How to Evaluate Relatedness Between Categorical Variables Using the Seaborn Library

Kuan Rong Chan

Written by Kuan Rong Chan, Ph.D.