How to use the Seaborn library to compare categorical variables

Kuan Rong Chan, Ph.D.
Omics Diary
Published in
6 min readOct 29, 2021

--

Categorical variables are often analysed with bar charts, box plots, violin plots or strip plots. While these data visualisation tools provide precise measurements with standard deviation, showing data trends for a large number of genes and proteins can be potentially challenging. In such circumstances, a heatmap or a clustergram may be more appropriate as the expression patterns across multiple genes or proteins can be more easily visualised. Unsupervised hierachical clustering can also allow direct visualisation of clusters without the need for dimensionality reduction.

In heatmaps or clustergrams, higher expression or abundance of genes can be depicted by using the colour red, and a greater increase in expression can be denoted with darker intensity. On the other hand, lower expression or abundance can be depicted by using the colour blue (better choice than green, as blue is more colour-blind friendly), with a stronger intensity indicating lower abundance.

Although plotting raw expression values in a heatmap is acceptable if the relative abundance between genes and proteins are similar, data normalisation is often recommended as the absolute abundance can vary across genes and proteins. Let me provide you with a case study to illustrate this point:

For instance, consider you want to plot expression levels of gene A and gene B in a single heatmap. Gene A has a baseline expression of 5 copies per cell whereas gene B has a much higher baseline expression of 50 copies per cell. Let’s assume the condition where a treatment condition reduces gene A expression to 2.5 copies/cell but increases gene B expression by 2 fold to 100 copies/cell. In this case, due to the high baseline abundance of gene B, plotting the heatmap based on absolute gene abundance will mask the ability to visualise differential expression of gene A after viral infection. To display the relative expression of both genes A and B in a single heatmap or clustergram, data normalisation is thus recommended.

There are two common strategies to for data normalisation. One way is to plot the heatmap or clustergram based on log2 fold-change instead of absolute abundance. The comparisons can be performed against time-point = 0 (baseline) for temporal studies, or against a control/placebo experiment (for static studies). However, the disadvantage of using this method is that the distribution of baseline or controls cannot be easily visualised.

Another alternative is to perform a Z-score transformation and plot the Z-scores in the heatmap. The Z-score transformation converts each gene or protein across all conditions with mean = 0 and standard deviation = 1, enabling users to easily compare expression values across multiple genes and protein. Values that are above 0 means the gene expression level is above mean expression, whereas values below 0 means that the gene expression level is below mean expression. However, the limitation is that Z-scores can only be applied for experiments that are obtained within a single experiment. Hence, the choice of either method for data normalisation is highly dependent on the research question and objectives.

Before jumping straight into plotting heatmap or clustergrams for your data, it is generally a good idea to first filter the dataset, especially when you have a large number of genes. The reason is because in most scenarios, the large majority of the genes or proteins remain unchanged, which will consequently impact the ability for the unsupervised hierarchical clustering to separate gene expression profiles based on your experimental conditions. To circumvent this limitation, you can choose to filter the dataset based on differentially expressed genes or enriched biological pathways before plot the heatmaps or clustergrams.

To plot heatmaps and clustergrams using Python, we first load the required packages. In this blog entry, we will be using the Seaborn library to plot the heatmaps and clustergrams:

import numpy as np
import pandas as pd
import seaborn as sns

Similar to my previous blog entries, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

We will load and inspect the processed dataframe from GitHub. It is important to label the gene column as the index column so that gene names can be referred to in the clustergram or heatmap. The commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

As described above, it is important to normalise the dataset to ensure that the relative expression is comparable between different genes. Here, we will use log2 fold-change for normalisation and create log2 fold-change (log2FC) columns in the dataframe:

df['log2FC_6h'] = np.log2(df['ratio_6h'])
df['log2FC_1d'] = np.log2(df['ratio_1d'])
df['log2FC_3d'] = np.log2(df['ratio_3d'])
df['log2FC_7d'] = np.log2(df['ratio_7d'])

There are a total of 17,623 genes that are measured. To visualise the comparisons between time-points better, we will filter the dataset. Since we have previously ascertained that day 1 has the most differentially expressed genes (DEGs), we could filter the dataset based on upregulated DEGs (with fold-change > 1.5, adjusted p-value < 0.05). The filtered dataframe is saved under DEGs_up_1d. Since we are interested in plotting the log2-fold change values, we will select the log2FC columns and remove all other columns. The code is as follows:

DEGs_up_1d = df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]
DEGs_up_1d = DEGs_up_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])

Output is as follows:

There are a total of 826 upregulated DEGs, of which their expression will be visualised using a clustergram. To plot the clustergram, the codes are as follows:

from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
from mpl_toolkits.axes_grid1.colorbar import colorbar
g = sns.clustermap(DEGs_up_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)

There are a few Seaborn settings that are displayed in the code, and the documentation can be found here. The colour map to be ‘vlag’ was chosen as it would allow us to have a heatmap where increased expression is red and reduced expression is blue. Note that I have also assigned maximum and minimum value of 2 and -2 respectively, as I wanted to ensure that log2FC = 0 is white (to signify no change). yticklabels = False was chosen because it is near impossible to see all 826 gene names in the clustergram. The output is as shown:

As expected, the day 1 signatures are the most distinct compared to other time-points. The 6 hour and day 7 signatures are clustered together, showing that these time-points have little or no changes in gene expression profile. Interestingly, some of the DEGs have prolonged expression up to day 3, while others resolve very quickly. Do we see the same trends for the downregulated DEGs? Let’s test it out with the following command:

DEGs_down_1d = df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]
DEGs_down_1d = DEGs_down_1d.filter(items=['log2FC_6h','log2FC_1d', 'log2FC_3d', 'log2FC_7d'])
g = sns.clustermap(DEGs_down_1d, cmap='vlag', method='average', vmin=-2, vmax=2, yticklabels=False)

Similar patterns can be seen. However, unlike upregulated DEGs where some DEGs persisted to day 3, most of the downregulated DEGs returned to baseline levels at day 3.

For easy referencing the full set of codes are as follows:

And there you have it. Thanks for reading.

More content at plainenglish.io

--

--

Kuan Rong Chan, Ph.D.
Omics Diary

Kuan Rong Chan, PhD, Senior Principal Research Scientist in Duke-NUS Medical School. Virologist | Data Scientist | Loves mahjong | Website: kuanrongchan.com