Data Visualisation Tips to Make Your Work Stand Out

Published in

The Startup

5 min readSep 17, 2020

Techniques to help you differentiate your data visualisation from the rest of the crowd

During the exploratory data analysis phase of our data science project we create various visualisations to study the data and to project it in a way that our audience can understand the data quickly pictorially. In python we have have various libraries like Matplotlib, Seaborn, Plotly etc. that help us in doing the same.

We would not get into the basics of these libraries but instead discuss some techniques that help us on building on the standard plots that we use for exploratory data analysis.

Truncated Correlation Matrix heatmap

Is it not annoying to have the mirroring in a correlation matrix and to make things worse there is the diagonal row with 1 as the correlation.

Normally this is what we do to create a correlation matrix heatmap using seaborn.

sns.heatmap(dataset.corr(),cmap="coolwarm",mask=corr_matrix,vmin=-1,annot=True)

There is a way to get rid of the repetitive part of the heatmap including the diagonal. The numpy triu function returns the copy of a matrix with the all the elements below the diagonal zeroed when we dont pass any parameter.

corr_matrix = np.triu(dataset.corr())
sns.heatmap(dataset.corr(),cmap="coolwarm",mask=corr_matrix,vmin=-1,annot=True)

If we use the numpy tril function, the lower half of the matrix will be zeroed.

Checking for Normal distribution

For machine learning problems we look for normality in the distribution of a continuous feature. When we plot the distribution of a feature using Seaborn we get the plot as below.

sns.distplot(data["chol"])

We don’t have to visually estimate how close is this to a normal distribution. There is a way to plot a normal curve probability density function estimate on this plot itself. We use the norm object from the scipy.stats module and fit it in this plot.

from scipy.stats import norm
sns.distplot(data["chol"],fit=norm)

The black colored line is the normal probability density function plot for this data. Let us verify this with a log transformation of the data.

sns.distplot(np.log1p(data["chol"]),fit=norm)

log transformed distribution plot with normal pdf

We can see that the curves almost converge which shows that our data has been transformed to a normal distribution.

Display percentages in grouped bar charts

In grouped bar charts we split a feature or aggregation of a feature across another feature. Let us see an example below where we check the count of heart patients who survived/died and had/did not have high blood pressure.
Normally we would like to know on the plot itself what percentage of patients who had high blood pressure survived and what percentage of patients who did not have high blood pressure survived.

ax=sns.countplot(x="high_blood_pressure",hue="DEATH_EVENT",data=dat)

We use the patch object in bar plots to calculate the percentages and get the coordinates where to place the percentage text.

ax=sns.countplot(x="high_blood_pressure",hue="DEATH_EVENT",data=dat)
patch = ax.patches
half = int(len(patch)/2)for i in range(half):
  pat_1= patch[i]
  pat_2 = patch[i+half]
  height_1 = pat_1.get_height()
  height_2 = pat_2.get_height()
  total = height_1  + height_2
  width_1 = pat_1.get_x()+pat_1.get_width()/2
  width_2 = pat_2.get_x()+pat_2.get_width()/2
  ax.text(width_1,height_1+1,"{:.0%}".format(height_1/total))
  ax.text(width_2,height_2+1,"{:.0%}".format(height_2/total))

Now we can say that out of the patients with normal blood pressure 71% survived where as patients who had high blood pressure only 63% survived.

Bar plots to show percentage of null values in the data

The first thing that we do before starting exploratory data analysis on a dataset is checking for null values. We use the simple isnull function to check for the null values. To get the feature wise null value count

dataset.isnull().sum()

Text output while checking for null values

We may want to represent this pictorially to get a sense of the sparsity of the dataset in a single glance. We can use two horizontal bar plots to achieve this as show below.

f,ax=plt.subplots(figsize=(12,10))
missing_perc= dataset.isnull().sum()/len(dataset)*100
ax1=plt.barh(missing_perc.index,[100]*len(missing_perc),edgecolor="black",color="white")ax2=plt.barh(missing_perc.index,100-missing_perc.values,color="b")ax.set_title("Missing Data")for p1,p2 in zip(ax1.patches,ax2.patches):
  plt.text(p1.get_width(),p1.get_y(),"{:.1%}".format(p2.get_width()/p1.get_width()))

Pictorial representation of missing count

Now we can quickly come to the conclusion that there are missing values in three features and the percentage is also available. This barplot gives an aesthetic look to the missing value count extraction process.

Geographical Data

We often have geographical data which we end up representing through barplots, linecharts, piecharts or some other plot where we end up losing the geographical significance of the data.

We can use the chorolpleth object in the plotly/plotly express library to plot our data on maps. The locations on the map can be referenced using GeoJSON, and also referencing countries becomes easier with the built in Geometries which recognise names of countries.

We need to specific the dataset, the locations (countries), the color to specific the continuous feature we are analysing, the mode in which the locations parameter is read and finally the data to be displayed on hovering.

import plotly.express as px
px.choropleth(dataset,locations="Country",color="Cummulative_Launches",locationmode="country names",hover_name="Cummulative_Launches")

I hope these techniques add value to your exploratory data analysis. As and when we encounter certain kind of data we have to improvise to accommodate and present the data visually in the way that we can gain insights out of it and it helps us in further analysis of the dataset.