Beautiful Boxplots With Statistical Significance Annotation
Super short tutorial for Boxplots With Significance Annotation in Python

Introduction & Motivation
Back then, I remember myself reading some nice scientific publications where the authors would have some nice boxplots. In most of these cases, a statistical test had been used to determine whether there was a statistically significant difference in the mean value of a specific feature between different groups.
I have now managed to create some custom python code to do exactly this: produce beautiful boxplots with statistical annotations integrated. In this short article, I just show how to create such beautiful boxplots in Python.
The dataset
We will use the Iris Dataset as we have done in all my previous posts. The dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). The dataset is often used in data mining, classification and clustering examples and to test algorithms.
For reference, here are pictures of the three flowers species:

For this short tutorial, we will be only using 2 out of the 3 classes i.e. the setosa and versicolor classes. This is done only for the sake of simplicity.
Working example in Python
Step 1: Let’s load the data and sub-select the desired 2 flower classes:
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np# Load the Iris dataset
X = load_iris().data
y = load_iris().target
feature_names = load_iris().feature_names
classes_names = load_iris().target_names# Use only 2 classes for this example
mask = y!=2
X,y = X[mask,:], y[mask]# Get the remained class names
classes_names[[0,1]]
# array(['setosa', 'versicolor'], dtype='<U10')
Step 2:We have now selected all the samples for the 2 classes: setosa & versicolor flower classes. We will put the data into a panda
dataframe to make our lives easier:
df = pd.DataFrame(X,columns=feature_names)
df['Group'] = [i for i in y]
df_long = pd.melt(df, 'Group', var_name='Feature', value_name='Value') # this is needed for the boxplots later ondf.head()
Step 3:Let’s inspect the dataframe:

As we can see, we have 4 features and the last column denote the group membership of the corresponding sample.
The statistical tests
Step 4: Now it’s time to do the statistical tests. We will use a two-sample t-test (since our group are independent) to test if the mean value of any of these 4 features (i.e. sepal length, sepal width, petal length, petal width) is statistically different between the 2 groups of flowers (setosa and versicolor).
#* Statistical tests for differences in the features across groups
from scipy import stats
all_t = list()
all_p = list()
for case in range(len(feature_names)):
sub_df = df_long[df_long.Feature == feature_names[case]]
g1 = sub_df[sub_df['Group'] == 0]['Value'].values
g2 = sub_df[sub_df['Group'] == 1]['Value'].values
t, p = stats.ttest_ind(g1, g2)
all_t.append(t)
all_p.append(p)
To do the statistical test we just used:
t, p = stats.ttest_ind(g1, g2)
Here we compare the mean of g1 (group 1: setosa) to the mean of g2 (group 2: versicolor) and we do that for all 4 features (using the for loop).
But how can we know if the mean of g1 (group 1: setosa) was significantly greater or smaller than the mean of g2 (group 2: versicolor) ?
For this we need to look at the t-values.
print(all_t)
[-10.52098626754911, 9.454975848128596, -39.492719391538095, -34.08034154357719]print(feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Interpretation:
- If the t-value is positive (>0) then the mean of g1 (group 1: setosa) was significantly greater than the mean of g2 (group 2: versicolor).
- If the t-value is negative (<0) then the mean of g1 (group 1: setosa) was significantly smaller than the mean of g2 (group 2: versicolor).
Reminder: feature_names = [‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’].
- We can conclude that only the mean value of sepal width of g1 (setosa) was statistically greater that the mean value of sepal width of g2 (versicolor).
Step 5: Check the t-test results
print(np.count_nonzero(np.array(feature_names)[np.array(all_p) < 0.05]))
# 4
Interpretation: We can see that there is a statistically significant difference in all 4 features between setosa and versicolor classes.
Step 6: Here is the magic. Let’s create some beautiful boxplots and annotate them with the estimated statistical significance.
# renaming so that class 0 will appear as setosa and class 1 as versicolor
df_long.loc[df_long.Group==0, 'Group'] = classes_names[0]
df_long.loc[df_long.Group==1, 'Group'] = classes_names[1]# Boxplots
fig, axes = plt.subplots(2,2, figsize=(14,10), dpi=100)
axes = axes.flatten()for idx, feature in enumerate(feature_names):
ax = sns.boxplot(x=”Feature”, hue=”Group”, y=”Value”, data = df_long[df_long.Feature == feature], linewidth=2, showmeans=True, meanprops={“marker”:”*”,”markerfacecolor”:”white”, “markeredgecolor”:”black”}, ax=axes[idx])
#* tick params
axes[idx].set_xticklabels([str(feature)], rotation=0)
axes[idx].set(xlabel=None)
axes[idx].set(ylabel=None)
axes[idx].grid(alpha=0.5)
axes[idx].legend(loc=”lower right”, prop={‘size’: 11})
#*set edge color = black
for b in range(len(ax.artists)):
ax.artists[b].set_edgecolor(‘black’)
ax.artists[b].set_alpha(0.8)
#* statistical tests
x1, x2 = -0.20, 0.20
y, h, col = df_long[df_long.Feature == feature][“Value”].max()+1, 2, ‘k’
axes[idx].plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
axes[idx].text((x1+x2)*.5, y+h, “statistically significant”, ha=’center’, va=’bottom’, color=col)fig.suptitle("Significant feature differences between setosa and versicolor classes/groups", size=14, y=0.93)
plt.show()

Conclusions
As we can see from the statistical tests, we can conclude that only the mean value of sepal width of group 1 (setosa) was statistically greater that the mean value of sepal width of group 2 (versicolor).
On the other hand, the mean value of sepal length, petal length and petal width of the Setosa group was statistically smaller that the mean value of the Versicolor group.
These observations can be also verified by looking at boxplots.
That’s all folks ! Hope you liked this article!
Stay tuned & support this effort
If you liked and found this article useful, follow me to be able to see all my new posts.
Questions? Post them as a comment and I will reply as soon as possible.
Latest posts
Get in touch with me
- LinkedIn: https://www.linkedin.com/in/serafeim-loukas/
- ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas
- EPFL profile: https://people.epfl.ch/serafeim.loukas
- Stack Overflow: https://stackoverflow.com/users/5025009/seralouk