Box Plot using Python: Data Summary by 5 Numbers

Ravish Kumar
EnjoyAlgorithms
Published in
5 min readFeb 19, 2024

Tabular Data contains hundreds and thousands of rows with multiple columns containing attribute values for those rows. Remembering all the data is impossible; hence, we summarize the data using various techniques and try to remember the entire data using those summaries. In this article, we will learn one method of summarizing the whole structured/tabular dataset using one of the most popular techniques in Data Science, i.e., Box Plot.

What is a Box Plot?

A box plot is a data visualization technique that summarizes the structured data using five useful numbers calculated from the data. These five numbers are:

  1. Minimum
  2. First quartile
  3. Median
  4. Third quartile
  5. Maximum

It is also known as a whisker plot in some literature. Let’s understand these five numbers in detail with the help of a dummy example.

Example1: Calculating the five numbers to summarize a dataset

Let’s say we have collected the maths marks out of 100 for 10 students from a class, and the records are 11, 07, 35, 55, 64, 90, 86, 88, 95, and 97. Let’s calculate the five numbers from this record. For that, rearranging the dataset in ascending order will be very handy. Records in ascending order would be 07, 11, 35, 55, 64, 86, 88, 90, 95, 97.

1. Minimum:

Here, the minimum mark is 07.

2. Median

A median represents the middle element of the data if it is arranged in ascending or descending order. In the rearranged example dataset, if “n” is the total number of samples in the data, there can be two scenarios: n is even or odd.

  • Odd number of samples: If n is odd, the observation at the [(n+1)/2]th position will be the middle of the arranged data and the median. For example, if we have five samples arranged in ascending order, the sample in the middle would be the sample at position (5+1)/2 = 3. Why? Because two samples would be on the left of the 3rd position sample and 2 samples on the right. The sample at the 3rd position would be the middle and hence median.
  • Even number of samples: If n is even, there will be no middle position. For example, when we have six samples, both 3rd and 4th samples will be in the middle. In this case, we calculate the average of the numbers present at the two middle positions: (n/2 and (n/2 + 1).

The example data above has 10 samples, and hence, the median would be the average of the 5th and 6th elements of the rearranged samples (07, 11, 35, 55, 64, 85, 88, 90, 95, 97): (64 + 86)/2 = 75.

So median is 75.

3. First quartile

A quartile is a way to divide the number of data samples (100%) into four equal subgroups, each containing 25% of the data. The number of samples is more or less equal in each group.

The first quartile is the median of the dataset present on the left of the median of the complete dataset. In our example, the data samples on the left side of the median are 07, 11, 35, 55, and 64. As the number of samples is odd, the median for this dataset would be 35, which will be the first quartile.

4. Third Quartile

The third quartile is the median of the dataset present on the right of the median of the complete dataset. In our example, the data samples on the right side of the median are 86, 88, 90, 95, and 97. As the number of samples is odd, the median for this dataset would be 90, which will be the third quartile.

5. Maximum

The maximum in the dataset is 97.

This concludes the calculation of those five magical numbers and can be used to summarize the complete dataset.

Box Plot

Stage 1: Now, we need to represent the entire dataset through the box plot. For that, we will first make an axis containing the range of data samples. An axis ranging from 0 to 100 will have all the data samples in our case.

Stage 2: Once the axis is drawn, we will make a rectangular box with one side starting from the first quartile and ending at the third quartile. There will be a line at the median parallel to the other side of the rectangular box.

Stage 3: Finally, a whisker will be drawn from the minimum to the first quartile and from the maximum to the third quartile. Let’s illustrate this box plot using the famous Seaborn Python library.

Make a Box Plot on the IRIS dataset.

We will use the famous IRIS dataset containing four features: “Sepal Length, Sepal Width, Petal Length, Petal Width,” and a corresponding label of three flower categories: “Setosa, Versicolor, Virginica”.

Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Read the dataset

df = pd.read_csv(
'https://raw.github.com/pandas-dev/'
'pandas/main/pandas/tests/io/data/csv/iris.csv'
)
print(df.head())

'''

SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
'''

Extract the features for which we want the plot

df_box = df[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

Plot the Box plot for all the features

plt.figure()
sns.boxplot(df_box)

From this box plot, we can quickly summarize the individual features, and simultaneously, we can compare all the features present in the dataset to know more about the inlined pattern.

Conclusion

Box plot provides a five-point summary of the entire numerical feature from the structured dataset. These five points are Minimum, Maximum, Median, First, and Third Quartiles. This is very famous and frequently used for Data Visualization.

Enjoy Learning.

16 Week Live Project-Based ML Course: Admissions Open

--

--

Ravish Kumar
EnjoyAlgorithms

Deep Learning Engineer@Deeplite || Curriculum Leader@ enjoyalgorithms.com || IIT Kanpur || Entrepreneur || Super 30