Python’s best visualization library — Seaborn
Python has many ways to create beautiful visuals, but the simplest and most effective library to explore data with beautiful graphs has to be Seaborn.
If you want to quickly get going, here’s a Seaborn cheat sheet from Datacamp :
As we can see in the cheat sheet, there are “categories” of graphs that we can create, depending on our data types and what we are trying to analyze:
- Regression & linear — regplot, lineplot, lmplot
- Distribution — distplot , histplot
- Categorical —barplot, boxplot, violinplot, scatterplot, countplot
- Matrix — heatplot, clusterplot
Let’s select a few graphs from each section and see if we find any interesting analyses from our 2 datasets:
- Titanic — Passenger information on the titanic, focusing on the survival
- Tips — Customer information from a restaurant focusing on the tip amount
Let’s install seaborn and setup our Jupyter notebook with the required libraries to get started
Setup
- Install Seaborn via Windows command line or Anaconda shell
2. Load our Jupiter notebook — import pandas, seaborn libraries and our two example datasets
import seaborn as sns
import pandas as pdtitanic = sns.load_dataset("titanic")
tips = sns.load_dataset("tips")
Check the first 5 rows of our data
Seaborn plots
With most of our plots, the parameters required are generally in the format:
sns.plottype( data =Dataframe , x =columnname ,y =columnname )
Regression and line plots
regplot
We start with a regression plot (regplot) and analyze the tips data so see any correlation between the number of tips and the bill amount
sns.regplot(data = tips, x = "total_bill",y = "tip" )
From the regplot we note that there is indeed a correlation here, where the greater the total bill the greater the tip, however regplot cant give us any further breakdown if we want to include a 3rd variable.
For this lets try lmplot() and add a “smoker” as a hue which will create plot 2 regression graphs on one axis
sns.lmplot(data = tips, x = “total_bill”,y = “tip” , hue = “smoker”)
Fantastic! We can see that non-smokers (vs smokers) tip more as the bill increases
Distribution
For a distribution or histogram plots, we only need an x-value which will be bucketed (or binned) into set ranges and counted
displot
A good distribution plot is displot where we can analyze the “age” distribution on the titanic. We can then include “sex” as a hue see the difference between male and female ages
sns.displot(data = titanic , x = "age" , hue = "sex", hue_order = ['female', 'male'])
The displot shows that most of the people onboard are aged around 18–35, and with the hue we can see that there are more males onboard (vs females) and have a higher number of elderly males
We also changed the hue_order here to bring the female color to the front and males second (try it without the hue_order and see the default order)
Categorical
For categorical plots we have quite a few, so lets show some keys ones that I use regularly
boxplot
sns.boxplot(data = titanic, x = "sex", y= "age")
The boxplot shows some similar information to the displot, however we can also see some outliers for male ages (around 68+)
violinplot
Lets see if we can enhance the boxplot with one of my favorite plots, the violin plot
sns.violinplot(data = titanic, x = "sex", y="age")
Alright! now we can see some density of ages, but what if we add another column “survived” as the hue
sns.violinplot(data = titanic, x = “sex”, y=”age”,hue = “survived”, split=True)
Ok this is quite a lot in one graph to lets break it down, we added “survived” to the hue and split (split = True) the plot down the middle to compare this better.
There is a lot to analyze but a quick glance shows that younger males look to have a higher change of surviving
Matrix
heatmap
My most useful plot in the matrix list is the heatmap, specifically when looking at correlation matrices.
Note: We can quickly create a correlation matrix using corr()
titanic.corr().head()
We can now create a heatmap of these correlations
sns.heatmap(data = titanic.corr())
This looks pretty good but we can make this better:
- Set our values to be between 0 ->1 so we can better see the min and max (vmin/vmax)
- Annotate (annot) the values to be inside the blocks
- Change the color palette to see the color changes better (cmap), you can find the different color palettes on the Seaborn website
sns.heatmap(data = titanic.corr(), vmin = -1, vmax = 1, annot=True , cmap = "Spectral")
From this heatmap we see a few small positive and negative correlations but some keys ones are the positive correlation between alone and adult_male and the negative correlation between adult_male and survived
Conclusion
We only highlighted a few plots available in Seaborn but hopefully examples in this article will give you a good base to get started.
Once you feel more comfortable with Seaborn have a look at their gallery and the tutorial section to better customize your graphs
All code and data is available on our github