# Plotting a Quantitative Variable in Your Dataset

## Exploring box plots, violin plots and histograms with cartoons, code, and definitions

## Background — This Series

With so many visualization tools available to a data scientist, I believe it is important to take a step back and think about each type of visualization available and what the best use case for it is. When I like to think about how best to use a tool, I sometimes like to anthropomorphize it, or even turn it into a cartoon character.

This is the first in a series I am making of graphs as cartoon characters for the purpose of better understanding them.

## Background — This Post

I pulled this data set about cereal from Kaggle to use to display various plots. If you would like to run the plot code that I have below, you will want to download and save that dataset, then run the following code:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as snsdf = pd.read_csv(‘./dataset/cereal.csv’)

df[‘cal_per_cup’] = df.calories/df.cups # adding column to look at calorie content per cup rather than per serving

Specifically, I was looking for a data set that had quantitative or continuous values (ie. Number of Calories, Grams of Fiber — anything with a numeric non-categorical variable) for each sample. Though my focus is on how to show these quantitative values, I also wanted a few categorical variables (ie. Is the cereal on the top, bottom, or middle shelf?), so that I could show how to break up samples by categories and display the quantitive values for each category. This cereal data set has a great mix of both.

# The Distribution Family

## The Parents: Box Plots Borat and Wanda

This family is in the business of explaining what lots of different samples of data measuring the same value mean. They can tell you what the median of those samples are if they’re normal or skewed (the kids can even tell you if they’re multi-modal).

Meet Borat the Box Plot and Wanda the Box Plot (although she likes to be known for her whiskers). They met at a trampoline convention. While jumping on the trampoline, Wanda lost one of her outliers as it bonked Borat on the head. He went to return it and the rest is history.

Borat and Wanda are quite alike. Borat likes to keep his look simple and tell stories as succinctly as possible. Wanda, on the other hand, likes to dress up a bit more. When she is explaining things, she spends a bit more time adding details to her stories. She wants her audience to understand a bit more about the anomalies she sees.

## Box Plots in Practice

“Borat” shows a box plot, sometimes referred to as a box and whisker plot, in its simplest form. It gives you five pieces of information about a distribution: Max Value, Min Value, Median, 75% percentile and 25% percentile.

Note in the plot code below, the default parameters for the graph do not set the whiskers to max and min. To make this type of graph you must include “whis=’range’”.

`plt.figure(figsize=(4, 8))`

sns.boxplot(df.cal_per_cup, **whis=’range’**,

orient=’v’, color=’#BFD0FE’)

plt.ylabel(‘Calories Per Cup’)

plt.title(‘Distribution of Calories in Cereals’);

“Wanda” shows how box plots are more frequently used (sometimes referred to as Tukey box plots) and how you can adjust the parameters that you are showing to select what you are conveying. The default is to set the top whisker to show the largest point within the 1.5 Interquartile range (IQR). The IQR refers to the difference in value between the 25th and 75th percentile. This means that the top whisker is placed at the highest data point below 1.5 * the difference in the 25th and 75th percentiles above the 75th percentile. The bottom whisker is placed at the lowest point within 1.5*IQR. Any points outside of the whiskers are plotted as outliers.

`plt.figure(figsize=(4, 8))`

sns.boxplot(df.cal_per_cup,

orient=’v’, color=’#F6A6A0')

plt.ylabel(‘Calories Per Cup’)

plt.title(‘Distribution of Calories in Cereals’);

Box plots are great for conveying the spread and the skew of a dataset. If you would like to show several distributions and how they compare at once, box plots are a great way to do this. For example, below is the distribution of calories per cup of cereal divided by what shelf the cereal is on.

`plt.figure(figsize=(15, 8))`

sns.boxplot([‘Bottom Shelf’, ‘Middle Shelf’, ‘Top Shelf’],

[df[df.shelf==1].cal_per_cup, df[df.shelf==2].cal_per_cup, df[df.shelf==3].cal_per_cup],

orient=’v’, color=’#F6A6A0')

plt.ylabel(‘Calories Per Cup’)

plt.title(‘Distribution of Calories in Cereals by Shelf Placement’, fontsize=20);

## The Son: Histogram Howard

After Borat and Wanda got together, they had three kids. One boy (Howard the Histogram) and twin girls named Viola and Violet (the violin plot pair).

Howard can be a bit rough around the edges depending on what he’s showing and how many bins he chooses to show it in. He has a reputation of being lazy, as he’s usually the only one lying down. But every now and then he’ll stand up when it’s appropriate, like for the family photo above.

## Histograms in Practice

A histogram will tell you a lot more about what a distribution shape is like — is it Gaussian? Is it uniform? Is it multi-modal? Is it skewed? Typically, a histogram does not make it immediately obvious where the median or quartiles are in a data set, although those can be added if desired.

When creating histograms, it’s important to think about how many bins you want as that will affect the clarity of your histogram and varies for all data sets (I often run a few graphs with different values set for the bin number to understand my data). In the code below, I set bins to equal to 5, then 20, then 50 to convey too few bins, a good number of bins and too many bins.

Note: for this data set, 20 bins conveyed the data well. However, there is no hard and fast rule for how many bins you should use and 20 could be bad for your data set. You should instead look at various numbers of bins and select the one that most clearly conveys the spread of values.

`plt.figure(figsize=(8, 4))`

plt.hist(df.cal_per_cup, **bins=20**,

color=’#BDFCC8', edgecolor=”#1F8F50")

plt.ylabel(‘Calories Per Cup’)

plt.title(‘Distribution of Calories in Cereals’);

Histograms can be good for comparing multiple datasets, but they don’t tend to be my favorite. You can either overlay histograms (best for a maximum of 3 datasets) or make a column of histograms to look at next to one another.

You can also use histograms to show two quantitative variables by creating marginal histograms adjacent to scatterplots. I have done this below with the fiber content versus the calorie content in the cereal data set:

`sns.jointplot(x=df.cal_per_cup, y=df.fiber, kind=’scatter’, color=’#1F8F50');`

## The Twin Girls: Violin Plot Viola and Violet

Viola and Violet are always attached at the hip — and often (but not always!) mirroring one another. While Howard can be seen with steps, Viola and Violeta like to wear Spanx over their rough edges. It’s rare to find them not smoothed out.

## Violin Plots in Practice

Violin plots have several features of box plots, but they also are created in a shape based on the kernel density estimation (KDE)— which is effectively a smoothing out of what you find in the histogram for that data set (this is why I pointed out that Viola and Violet always wear Spanx to smooth out their edges).

`plt.figure(figsize=(4, 8))`

sns.violinplot(df.cal_per_cup, color=’#F0BFFF’, orient=’v’)

plt.ylabel(‘Calories Per Cup’)

plt.title(‘Distribution of Calories in Cereals’);

Like box plots, violin plots are great for showing several sets of data side by side to compare two sets of data. See the data broken up by shelf placement again below:

`plt.figure(figsize=(15, 8))`

sns.violinplot(x=df.shelf, y=df.cal_per_cup, color=’#F0BFFF’)

plt.xticks([0,1,2], [‘Bottom Shelf’, ‘Middle Shelf’, ‘Top Shelf’])

plt.ylabel(‘Calories Per Cup’, fontsize=16)

plt.xlabel(‘Cereal Placement’, fontsize=16)

plt.title(‘Distribution of Calories in Cereals by Shelf Placement’, fontsize=20)

plt.tight_layout();

Another great feature of violin plots is that if you have a binary category that you would like to display your data by, you can split the violin plot to show the KDE (the smoothed out histogram) of one set of data on the left and the other set of data on the right (This is why Viola and Violet are not identical twins — they sometimes look very different). The middle section shows the median and other values for the combined data.

The plot below shows the calories per cup with the data divided in to cereals made by General Mills and cereals made by Kelloggs (the manufacturers of the majority of cereals in the data set).

`plt.figure(figsize=(15, 8))`

plot = sns.violinplot(x=df[(df.mfr == ‘K’) |(df.mfr == ‘G’)].shelf,

y=df[(df.mfr == ‘K’) |(df.mfr == ‘G’)].cal_per_cup,

hue=df[(df.mfr == ‘K’) |(df.mfr == ‘G’)].mfr, split=True,

color=’#F0BFFF’)

handles, labels = plot.get_legend_handles_labels()

plt.xticks([0,1,2], [‘Bottom Shelf’, ‘Middle Shelf’, ‘Top Shelf’])

plt.ylabel(‘Calories Per Cup’, fontsize=16)

plt.xlabel(‘Cereal Placement’, fontsize=16)

plt.legend([handles[0], handles[1]], [‘Kelloggs’, ‘General Mills’], title=’Manufacturer’)

plt.title(‘Distribution of Calories in Cereals \nby Shelf Placement and Manufacturer’, fontsize=20)

plt.tight_layout();

## Conclusion

If you would like to see my jupyter notebook for all of the plots posted, you can find it here. I look forward to bringing you my next set of graphs as cartoons soon. Please let me know if you have any requests!!