Box Plot with Seaborn
What are we going to learn today?
In this article, we will learn how to create a box plot using Seaborn. A box plot is a type of chart that is often used in exploratory data analysis.
It is a standard method of visualizing data distribution and it uses a five-number summary — the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and “maximum”.
Box plot also tells about skewness and outliers in the data.
Why is it important to learn?
Few reasons to learn box plot are:
- Box plot can show several statistical measures in a compact form.
- It can help detect outliers in data.
- It can help determine the symmetry and skewness of the data. A
How can we achieve today’s goal?
The plan for today is:
- Create a box plot using Python Seaborn
- Fix the values on the x-axis according to the data
- Box plot with a categorical variable
- Using hue parameter
- Box plot of each numerical in the data set
- Adding jitter
- Conclusion
Let’s import the required library.
Input
import pandas as pdimport numpy as npfrom matplotlib import pyplot as pltimport seaborn as sns
We will use the tips and iris dataset for this article. You can load these datasets directly from Seaborn.
Let’s load the datasets using the load_dataset method in Seaborn.
Input
tips = sns.load_dataset(‘tips’)iris = sns.load_dataset(‘iris’)
Create a box plot using Python Seaborn
To visualize a box plot, we will use the Seaborn method boxplot in which two parameters will be passed x and data.
Input
sns.boxplot(x=’total_bill’, data=tips);
Output
In the above plot, we visualize a box plot for total_bill. It shows the 5-number descriptive statistics summary using a box plot.
Fix the values on the x-axis according to the data
But the value on the x-axis having step is 10. We can change this with the help of the matplotlib xticks function.
Input
sns.boxplot(x=’total_bill’, data=tips);plt.xticks(np.arange(1,55,3)); # np.arange(start, stop, step)
Output
We can visualize a box plot using categorical variables. It will create a box plot for each day in the data set.
Box plot with a categorical variable
Let’s pass the day column on the x parameter and the total_bill column on y.
Input
sns.boxplot(x=’day’, y=’total_bill’, data=tips);
Output
Using the hue parameter
If you want more in-depth knowledge from the data set, you can use the hue parameter, which we discussed in the bar charts.
If we set hue equal to sex (column) in the same plot above, it will also show the box plot of each day for males and females separately.
Input
sns.boxplot(x=’day’, y=’total_bill’, hue=’sex’, data=tips);
Output
Box plot of each numerical in the data set
If we give the whole data set to a box plot, it will create a box plot of each numerical column present in the data set.
In iris data set there are four numerical columns named sepal_length, sepal_width, petal_length, _petal_width.
Input
sns.boxplot(data=iris);
Output
Adding Jitter
If we see the box plot of Friday, it seems that Friday has higher or equal values than Thursday, but this is not the case because the box plot only shows the summary.
If you want to see the amount of data you are working on, adding jitter to the plot can make the plot more insightful.
Input
sns.boxplot(x=’day’, y=’total_bill’, data=tips);
Output
We will use the stripplot function from Seaborn to add a jitter on the box plot.
Input
sns.boxplot(x=’day’, y=’total_bill’, data=tips);sns.stripplot(x=’day’, y=’total_bill’, data=tips, color=’black’, jitter=0.2);
Output
And now you can see new patterns. Before making any assumption that Friday has more or fewer values than the others, it is visible that Friday has a small sample size compared to others.
Conclusion
This article covered the box plot with real-world datasets. Thank you for reading hope you found it helpful. Check out the rest of my articles here.