Day (2) — DS — How to use Seaborn for Distribution Plots

Keith Brooks
6 min readFeb 27, 2018

--

Photo by Lukas Blazek on Unsplash

This article covers work from the Python for Data Science and Machine Learning Bootcamp course on Udemy by Jose Portilla and helpful tips along the way. This course was very helpful in gaining a base understanding of the topic.

Have you ever received a dataset and said, “Well, this is nice but how can I use it?”. Or what does this data look like? This is only my second exposure to Seaborn, but I am quickly finding reasons why to like it. This post will review the following…

1) What is Seaborn
2) How do I install Seaborn
3) Where can I find the Official Docs
4) Using Distribution plots in Seaborn:
~ .distplot() method -> for displaying single variable data distribution (i.e. think histogram)
~ .jointplot() method -> for combining two distribution plots to display together
~ .pairplot() method -> for conducting joint plots for every numerical column combination in a dataset…(This seems to be a really tool to quickly assess your data and possible relations)
~ .rugplot() method -> for displaying single variable distributions via dash lines. (i.e. this is helpful when building the logic for KDE (Kernel Distribution Estimation) plots)

This example is using Jupyter Notebooks with Python 3.6.

Step (1) Seaborn — First Things First

Seaboard(https://seaborn.pydata.org/) is a visualization library that provides a easy to use interface for generating statistical plots for Python. It is built on top of matplotlib and works well with pandas data frames(https://pandas.pydata.org/).

Two options to install the library we can use the following commands in the terminal (i.e. using mac). Either pip or pip3 depending on the version you are using.


$ condo install seaborn

or

$ pip install seaborn

To gain access to the various types of plots available access the gallery(https://seaborn.pydata.org/examples/index.html#) tab on the site and the API(https://seaborn.pydata.org/api.html) to understand how to effectively call plots.

“White intricate cobweb art in the dark shown at Power Station of Art” by Jingyi Wang on Unsplash

Step (2) Creating some Distribution plots

Standard Dataset*

To get started, feel free to load the tips dataset to practice using the various plots.


# Import libraries and set matplotlib to inline to display plots within #jupiter notebook
import seaborn as sns
%matplotlib inline
# Use the .load_dataset() method to and the s
tips = sns.load_dataset(‘tips’)

We have now have a data frame with tips data. Quick tip…Always look at your data before you begin to clean and analyze. A few helpful methods are as follows.


# Review the first 10 rows of the data frame
# This provides a good view of the number of columns and data
tips.head(10)
# Use the .describe() method to display count, mean, max, min and other info.
# on the data frame…This helps ID missing data
tips.describe()
# Use the .info() method to determine the datatypes of each column
# The helps you identify if anything column datatypes.
tips.info()

Now we can plot our first distribution plot. Distribution plots are handy when trying to split records into buckets to observe patterns. We can use the below code as an example. Let us practice and look deeper into the total_bill column.

.distplot() method


# Pass in the bins parameter to adjust the bin sizes for the histogram
sns.distplot(tips[‘total_bill’], kde=False, bins=40)

The sns allows us to use the .distplot() method on the ‘tips’ dataset. We pass in ‘total_bill’ to select only the total bill column. The ‘kde=False’ removes the KDE plot and only displays the histogram. If desired, please feel free to keep it as True. The bins allows us to increase or decrease the amount of buckets to distribute the data. (Warning) The more bins we have, the noisier the plot looks and we lose some insight.

Now let us look at the .jointplot() method. This is helpful when we want to compare two distributions. Let us practice with the following…

.jointplot()


# Use the .jointplot() method with the kind parameter to adjust plot
sns.jointplot(x=’total_bill’, y=’tip’, data=tips, kind=’hex’)

This .jointplot() method takes three arguments with the kind as optional. The ‘x’ is the first column while the ‘y’ is the second column to compare. The data argument refers to our dataset. The kind argument allows us to create some neat visualizations.

Optional kind arguments:
1) kind=’hex’ — creates a hexagon plot that has darker colors for high density areas.

2) default — creates a scatter plot with individual points

3) kind=’reg’ — creates a regression line to display the linear fit

4) kind=’kde’ — displays the density of where the points line up the most

Now let us look at the .pairplot() method. This is very helpful to identify relationships across the dataset. This plots for numerical value columns and also supports a color hue for categorical columns. Basically, this generates a .jointplot() for every numerical column combination in the dataset. Nice way to quickly visualize data. We can practice with the following…

# Use the .pairplot() method to compare relationships across numerical series in the dataset w/ the hue argument to
# add categorical elements to the plots
sns.pairplot(tips, hue=’sex’, palette=’coolwarm’)

The .pairplot() method takes the dataset as an argument, an optional ‘hue’ for categorical division and ‘palette’ for custom colors.

Finally, the .rugplot() method is used to draw lines for single variable distribution. This seems to provide a more discrete visual than using the .distplot() method. Still trying to find more reasons to use this; however, it does form the basis for the KDE plot. Let us practice with the following…


# Use the .rugplot() method with one column passed into it
sns.rugplot(tips[‘total_bill’])

This .rugplot() method takes in the dataset with a single column as the argument. Will need to practice with a few more datasets to find additional use cases for this one.

Well, that is the end of this post. Hope there were some useful nuggets in there. Here is a quote to reflect on throughout your day…

‘Only the educated are free.’ ~ Epictetus

--

--

Keith Brooks

Hi…This is intended to document my journey on my path to data science. I am a process oriented individual who loves to try new things and continue to learn.