Beautiful Beginner Box Plots in Python

Exploring Data Quantitatively With Boxes

Lewis Gallagher
Save The Data
8 min readOct 20, 2021

--

Photo by Sahand Babali on Unsplash

Why box plots?

In data science it is a common task to visualise the impact of categorical variables (such as Country A vs Country B or cats vs dogs) against a continuous measurement (such as weight or test results and so on.). Box plots are the perfect tool for visualising such data.

Seaborn or Matplotlib?

Plotting in Python from scratch can be a little daunting. Matplotlib is, in my experience, a complicated package and not a great place to start for beginner plotters. Seaborn is a wrapper for Matplotlib making the syntax and plotting parameters far more user-friendly and readable. We’re going to create beautiful and reproducible box plots, the perfect plot for comparing categorical variables with continuous measurements.

1. Install required packages

If you want to interact with tables of data in Python, the easiest way hands-down is with the Pandas package. It also happens to play very nicely with our plotting package, Seaborn. You will need to install these via pip or conda. For tutorials on how to install Python packages through these package managers, check out our other tutorials.

Once they’re installed, we need to tell python we want to use them with import.

Tip: we use the as pd and as sns syntax to give these libraries short hand accessors, allowing us to access Pandas by typing pd instead of writing the full name every time.

2. Create a data frame

You’ll first need a table of data. The easiest way to manage data tables in Python is with the Pandas library. It lets you visualise your tables as you code and integrates with plenty of plotting packages.

An example data set can be accessed at the URL below. The fmri.csv file contains FMRI signal data from multiple subjects, scanning regions and time points. The pandas pd.read.csv() function reads a comma delimited file into a data frame class.

3. Interrogate the data

It’s always smart to check if your data looks as expected before attempting any visualisation. Let’s look at the region column (stored under mri[“region”]) and the signal column (stored under mri[“signal”]).

First of all, what do these variables even look like? The pandas .describe() method returns some summary statistics of data frame columns.

Region is an object, meaning it is a categorical variable.

Signal is a float, meaning it is a continuous variable.

4. Plotting with Seaborn

We now have our data nicely organised in a Pandas data frame called mri. We can pass this data over to Seaborn, a pretty and user-friendly method of plotting data.

Seaborn can produce a box plot by using the boxplot() function. Three variables are required:
1. data is our Pandas data frame: mri
2. x is our categorical variable: region
3. y is our continuous variable: signal

Don’t forget to run import seaborn as sns if you haven’t already!

Nice!

A boxplot has several elements, which the function boxplot() has computed on our behalf, for each region in the region column. The line across the middle of the boxes indicates the median value of the data. The boxes coloured areas indicate the upper and lower quartiles, and the whiskers indicate the minimum and maximum values after removal of outliers. The threshold for aoutlier is:

values greater than 1.5*IQR + 3rd quartile
values less than 1.5*IQR — 1st quartile

Any data points that lie beyond these boundaries are known as ’fliers’ and are represented by diamonds on this box plot.

The above box plot looks okay, but I think we can do better…

5. Add data points

We can add dots to represent the actual raw data values, using Seaborn’s stirpplot().

Nearly there! It still looks a bit clunky…

Let’s change the box and dot colours (click here to see a list of valid colours), make the boxes thinner and change the plot background. This is all done by providing a few more variables beyond our data, x and y variables. And can be found in the Seaborn documentation.

Don’t be scared! Take it one line at a time.

Now let’s add plot title and format the axis labels.

Neat, right? Let’s draw some conclusions from this data by asking if the FMRI signal is significantly different between these two regions.

6. Add a statistical test

Now let’s perform a statistical test to see if there is a significant difference between these two measurements. Our null hypothesis is that no significant difference in signal measurement exists between the two scanned regions.

To test this, we can simply import the ttest_ind library from scipy.stats; this package is included with Python. No installation is necessary, just import it and you’re good to go. Apply thettest_ind function to our two groups.

We just want the p-value, so take the value stored under the [1] index.

7. Bring it all together

Now we have the t-test result stored in the pval variable. We can use this variable when writing the plot’s title. Check out the line near the bottom where we define b.set_title.

8. Save the figure

Now that’s a beautiful box plot. When you’re happy with how it looks you can save the figure. Give it a name and lower/raise the dpi parameter to change the resolution of the saved image.

9. Final thoughts

We’ve covered some key points to consider when producing box plots.

  • If your data is a mix of categorical and continuous data, a box plot is a brilliant way to visualise it.
  • You can visualise both statistical measurements and the raw data on the same plot.
  • There are lots of ways to customise plots in Python with Seaborn.
  • A statistical test can helps us draw conclusions from the data, we can combine the statistical result with the plot labels to add value to our box plot.

Thank you for reading! If you have any questions or suggestions for other articles, you are more than welcome to comment. I hope you enjoyed it!

If you’d like to see how this is done in R, please head over to my good friend George Seed’s article on How to Make A Boxplot in R.

--

--

Lewis Gallagher
Save The Data

Writing bioinformatics software and statistical analysis for metagenomics and next-generation sequencing data.