Percentiles, Box Plots & their intersection with Data Science

Ameya Shukla
4 min readJul 14, 2021

--

In this article, I would like to share about the statistics concept called percentile and its relation with box plot within the context of data analytics.

PERCENTILE

A percentile basically shows the weightage of a specific value or a data point when compared with the other different values in the dataset. That specific value is the point under which a particular percentage of values lies.
For example, if we consider a dataset where the values are [10, 30, 40, 70, 95], we can roughly say that the value 70 is greater than 80% of the total values in the dataset and the value 70 is therefore called as 80 percentile. This simply means that 80% of the total values in the dataset are at or below 70. Similarly, the value 30 is at or greater than 40% of the total given values and is called 40 percentile.

The formula for calculating percentile is as follows:

Percentile(X) = (Number of Values Less than “X” in the dataset / Total Number or Total Count of Values in the dataset) × 100

where,
X = value for which for which the percentile is to be calculated.

Let us consider a small dataset example for illustration. If we consider a dataset :

data = [55, 43, 60, 68, 22, 15, 76, 88, 92, 96]

Firstly, we need to sort this data from smallest value to highest value. The sorted data will be :

sorted_data = [15, 22, 43, 55, 60, 68, 76, 88, 92, 96]

Now, suppose we want to find the percentile of the value 88.

X = 88
Number of Values less than X in the dataset(sorted_data) = 7
Total Number of Values in the dataset(sorted_data) = 10

Therefore, according to the formula for percentile above,

Percentile(88) = (7 / 10) × (100)
Percentile(88) = 70

So, 88 is the 70th percentile in the dataset. this means that 70% values in the dataset are below 88.

BOX PLOT

A Box Plot can be considered as a visual representation of the percentiles of values of the dataset in the form of a plot. Box Plot gives two valuable information about the data from the plot : median and the spread(range) of data. Outlier data can also be seen through this plot.
A Box plot displays the distribution of data points grouped together and then divides and summarises them into 5 categories or groups, which are called “minimum”, “first quartile (Q1)”, “median”, “third quartile (Q3)”, and then finally “maximum”.
The Interquartile Range (IQR) is where that the majority of the data is present. The IQR contains ≈50 percentile values. The first quartile (Q1), median and third quartile (Q3) when grouped together form the IQR. The shape formed by grouping them together is of a “Box”.
There are ≈25 percentile data from “minimum” to “Q1”, ≈25 percentile data from “Q1” to “median”, ≈25 percentile data from “median” to “Q3”, and ≈25 percentile data from “Q3” to “maximum”. This is how the total data is distributed in this plot.

Below is a Box plot that I created using Seaborn with a Stroke dataset wherein I have plotted a box plot showing the percentile of people who had a stroke history or not distributed on the basis of age.
The dataset is available on Kaggle here. However, if you do not have a Kaggle account, you can download the dataset from my github here.
You can also find this Jupyter Notebook file in my github repository here.

This image is the output image of the Box Plot Plotted above for a better view.

For a description on construction of the box plot from a mathematics point of view, I would recommend you to watch the Youtube video below from “Khan Academy”.

I also found this article below on box plots extremely valuable and descriptive as it gives a detailed information above the box plot by explaining each and every aspect of the plot. I would highly recommend you to read it.

Source

The image above is comparison of the Box Plot with the Probability Density Function(PDF) of a Standard Normal Distribution which describes how the IQR, median, Q1, Q3, minimum and maximum are plotted.
One important thing to notice here are the outliers. The outliers overall make around 7% (0.35% + 0.35%) of the total dataset.

CONCLUSION

In this article, I shared about the concept of percentile and box plots, how to understand box plot visually. I also described about the relation between percentile and box plot as how the plot is a distribution of data in percentiles based on some which are given as parameters.

I would like to hear your feedbacks on this article. You can reach out to me on Linkedin or in the comments.

--

--