Getting Started with Seaborn
Distribution Plots
- distplot
- jointplot
- pairplot
- rugplot
- kdeplot
import seaborn as sns
# To show the graphs within the notebook
%matplotlib inlinetips=sns.load_dataset('tips')
tips.head()
DistPlot
The distplot shows the distribution of a univariate set of observations.
sns.distplot(tips['total_bill'])<matplotlib.axes._subplots.AxesSubplot at 0x2058a0d8198>
# To remove the KDE set kde parameter to false and to set bins set value of bins accordingly
sns.distplot(tips['total_bill'],kde=False,bins=20)<matplotlib.axes._subplots.AxesSubplot at 0x2058a3c1ba8>
Jointplot
jointplot() allows you to basically match up two distplots for bivariate data
# Various kind paramaters scatter , reg, resid, kde, hex
sns.jointplot(x='total_bill',y='tip',data=tips, kind='scatter')
sns.jointplot(x='total_bill',y='tip',data=tips, kind='kde')
sns.jointplot(x='total_bill',y='tip',data=tips, kind='hex')
sns.jointplot(x='total_bill',y='tip',data=tips, kind='reg')<seaborn.axisgrid.JointGrid at 0x2058bd6f390>
pairplot
pairplot will plot pairwise relationships across an entire dataframe (for all the numerical columns) and supports a color hue argument (for categorical columns)
sns.pairplot(tips)<seaborn.axisgrid.PairGrid at 0x2058d3584e0>
sns.pairplot(tips, hue='sex',palette='coolwarm')<seaborn.axisgrid.PairGrid at 0x2058e176668>
rugplot
rugplots just draw a dash mark for every point on a univariate distribution.
sns.rugplot(tips['total_bill'])<matplotlib.axes._subplots.AxesSubplot at 0x2058eba3470>
kdeplot
kdeplots are Kernel Density Estimation plots. These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats#Create dataset
dataset = np.random.randn(25)# Create another rugplot
sns.rugplot(dataset);# Set up the x-axis for the plot
x_min = dataset.min() - 2
x_max = dataset.max() + 2# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min,x_max,100)# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth'bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2
# Create an empty kernel list
kernel_list = []# Plot each basis function
for data_point in dataset: # Create a kernel for each point and append to list
kernel = stats.norm(data_point,bandwidth).pdf(x_axis)
kernel_list.append(kernel) #Scale for plotting
kernel = kernel / kernel.max()
kernel = kernel * .4
plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)plt.ylim(0,1)(0, 1)
# To get the kde plot we can sum these basis functions.# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list,axis=0)# Plot figure
fig = plt.plot(x_axis,sum_of_kde,color='indianred')# Add the initial rugplot
sns.rugplot(dataset,c = 'indianred')# Get rid of y-tick marks
plt.yticks([])# Set title
plt.suptitle("Sum of the Basis Functions")Text(0.5, 0.98, 'Sum of the Basis Functions')
sns.kdeplot(tips['total_bill'])
sns.rugplot(tips['total_bill'])<matplotlib.axes._subplots.AxesSubplot at 0x2058fe52780>
Categorical Data Plots
- factorplot
- boxplot
- violinplot
- stripplot
- swarmplot
- barplot
- countplot
barplot and countplotPermalink
These plots allow to get aggregate data off a categorical feature in your data. barplot is a general plot that allows you to aggregate the categorical data based off some function, by default the mean. Count plot does the aggregation at counts
sns.barplot(x='sex',y='total_bill',data=tips)<matplotlib.axes._subplots.AxesSubplot at 0x20591a0fc18>
#using estimator we can override default aggregation type
import numpy as np
sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.sum)<matplotlib.axes._subplots.AxesSubplot at 0x205907a01d0>
sns.countplot(x='sex',data=tips)<matplotlib.axes._subplots.AxesSubplot at 0x20591084828>
boxplot and violinplot
boxplots and violinplots are used to shown the distribution of categorical data.
A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.
sns.boxplot(x='day',y='total_bill',data=tips, palette='rainbow')<matplotlib.axes._subplots.AxesSubplot at 0x20591c8eba8>
#To do the boxplot on entiredataframe
sns.boxplot(data=tips, palette='rainbow',orient='h')<matplotlib.axes._subplots.AxesSubplot at 0x20591bacb00>
# to add another categor add hue
sns.boxplot(x='day',y='total_bill',data=tips, palette='rainbow',hue='sex')<matplotlib.axes._subplots.AxesSubplot at 0x20591e005f8>
sns.violinplot(x='day',y='total_bill',data=tips, palette='rainbow')<matplotlib.axes._subplots.AxesSubplot at 0x20592f2b8d0>
sns.violinplot(x='day',y='total_bill',data=tips, palette='rainbow',hue='sex')<matplotlib.axes._subplots.AxesSubplot at 0x2059302ec18>
#use split to merge into 1
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',split=True,platette='set1')<matplotlib.axes._subplots.AxesSubplot at 0x20593306198>
stripplot and swarmplot
The stripplot will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.
The swarmplot is similar to stripplot(), but the points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, although it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).
sns.stripplot(x="day", y="total_bill", data=tips)<matplotlib.axes._subplots.AxesSubplot at 0x20593382cc0>
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True)<matplotlib.axes._subplots.AxesSubplot at 0x205933e3e10>
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1')<matplotlib.axes._subplots.AxesSubplot at 0x20593413a58>
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1',split=True)<matplotlib.axes._subplots.AxesSubplot at 0x20593623ac8>
sns.swarmplot(x="day", y="total_bill", data=tips)<matplotlib.axes._subplots.AxesSubplot at 0x2059368ce10>
sns.swarmplot(x="day", y="total_bill",hue='sex',data=tips, palette="Set1", split=True)<matplotlib.axes._subplots.AxesSubplot at 0x2059373d860>
#Combining Categorical Plots
sns.violinplot(x="tip", y="day", data=tips,palette='rainbow')
sns.swarmplot(x="tip", y="day", data=tips,color='black',size=3)<matplotlib.axes._subplots.AxesSubplot at 0x20594785ef0>
factorplot
factorplot is the most general form of a categorical plot. It can take in a kind parameter to adjust the plot type:
sns.factorplot(x='sex',y='total_bill',data=tips,kind='bar')
sns.factorplot(x='sex',y='total_bill',data=tips,kind='box')
sns.factorplot(x='sex',y='total_bill',data=tips,kind='violin')<seaborn.axisgrid.FacetGrid at 0x20594949828>
Matrix Plots
Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data
flights = sns.load_dataset('flights')
flights.head()
Heatmap
# For Heatmap to work we need to convert the data in matrix form using corr fn or pivoting the data
# Matrix form for correlation data
tp=tips.corr()sns.heatmap(tp)<matplotlib.axes._subplots.AxesSubplot at 0x205916c6400>
#use annot to show labels
sns.heatmap(tips.corr(),cmap='coolwarm',annot=True)<matplotlib.axes._subplots.AxesSubplot at 0x20594b43208>
flights.pivot_table(values='passengers',index='month',columns='year')
pvflights =flights.pivot_table(values='passengers',index='month',columns='year')
sns.heatmap(pvflights)<matplotlib.axes._subplots.AxesSubplot at 0x20594befa20>
# use line color and line width to improve look and feel
sns.heatmap(pvflights,cmap='magma',linecolor='white',linewidths=1)<matplotlib.axes._subplots.AxesSubplot at 0x20594de2dd8>
clustermap
The clustermap uses hierarchal clustering to produce a clustered version of the heatmap
sns.clustermap(pvflights)<seaborn.matrix.ClusterGrid at 0x20595e516d8>
sns.clustermap(pvflights,cmap='coolwarm',standard_scale=1)<seaborn.matrix.ClusterGrid at 0x20595f35e80>
Regression Plots
lmplot allows you to display linear models, but it also conveniently allows you to split up those plots based off of features, as well as coloring the hue based off of features.
sns.lmplot(x='total_bill',y='tip',data=tips)<seaborn.axisgrid.FacetGrid at 0x20596664208>
#adding another category
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex')<seaborn.axisgrid.FacetGrid at 0x205968690b8>
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex',palette='coolwarm')<seaborn.axisgrid.FacetGrid at 0x205968e1780>
# specify markers to distinguish ,
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex',palette='coolwarm',markers=['o','v'],scatter_kws={'s':100})<seaborn.axisgrid.FacetGrid at 0x205969c03c8>
## Using a Grid
sns.lmplot(x='total_bill',y='tip',data=tips,col='sex')<seaborn.axisgrid.FacetGrid at 0x20596ae3c18>
#provide row and column to lmplot
sns.lmplot(x="total_bill", y="tip", row="sex", col="time",data=tips)<seaborn.axisgrid.FacetGrid at 0x20597ef0da0>
# plot for different days
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm')<seaborn.axisgrid.FacetGrid at 0x20598122860>
#adding aspect and size
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm',
aspect=0.6,size=8)<seaborn.axisgrid.FacetGrid at 0x20598b19780>
Grids
Grids are general types of plots that allow you to map plot types to rows and columns of a grid, this helps you create similar plots separated by features.
iris = sns.load_dataset('iris')
iris.head()
PairGrid
Pairgrid is a subplot grid for plotting pairwise relationships in a dataset.
# Just the Grid
sns.PairGrid(iris)<seaborn.axisgrid.PairGrid at 0x2059ae0b940>
# Then you map to the grid
g = sns.PairGrid(iris)
g.map(plt.scatter)<seaborn.axisgrid.PairGrid at 0x2059be1ac50>
# Map to upper,lower, and diagonal
g = sns.PairGrid(iris)
g.map_diag(plt.hist)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)<seaborn.axisgrid.PairGrid at 0x2059d5ff0b8>
pairplotPermalink
pairplot is a simpler version of PairGrid
sns.pairplot(iris)<seaborn.axisgrid.PairGrid at 0x20599c9beb8>
sns.pairplot(iris,hue='species',palette='rainbow')<seaborn.axisgrid.PairGrid at 0x2059e493b38>
Facet Grid
FacetGrid is the general way to create grids of plots based off of a feature:
# Just the Grid
g = sns.FacetGrid(tips, col="time", row="smoker")
g = sns.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
g = sns.FacetGrid(tips, col="time", row="smoker",hue='sex')
# Notice hwo the arguments come after plt.scatter call
g = g.map(plt.scatter, "total_bill", "tip").add_legend()
JointGrid
JointGrid is the general version for jointplot() type grids, for a quick example:
g = sns.JointGrid(x="total_bill", y="tip", data=tips)
g = sns.JointGrid(x="total_bill", y="tip", data=tips)
g = g.plot(sns.regplot, sns.distplot)
#style and colorsns.set_style('white')
sns.countplot(x='sex',data=tips,palette='deep')
sns.despine()
sns.despine(left=True)