All About Whisker and Box Plot

(Noobs, I got your back!)

The picture was taken from Google

Whisker boxplot serves some sort of the same purpose as the standard deviation, but more visually. It lets us know the overall behavior or dispersion of the data without having to look into the dataset closely.

For understanding whisker boxplot, we need to be familiar with an important term widely used in statistics, that is 5-number-summary of a dataset. The 5 numbers include the minimum value, Quartiles (all 3 of them) and the Maximum value. I will explain all of them in the easiest way possible and lead you comfortably to our today’s agenda- Whisker boxplot.

Maximum and Minimum value: Just the largest and the smallest value/data within a dataset.

Quartiles: Quartiles have only three parts- Q1, Q2, Q3. Please note that Quarters and Quartiles are different things. Unlike Quarters, Quartiles have no Q4. However, Quartiles divide the data into four regions.

This picture will feel like very intimate once you complete reading the whole.

To plot a Whisker boxplot manually, finding all the quartiles is a must. Finding the quartiles are very easy. Q2 is simply the median value of a dataset. I will demonstrate the whole process.

Finding quartiles for an ODD number of data within a dataset:

Data: 8,9,2,10,3,5,7,12,15
Order: 2,3,5,7,8,9,10,12,15
So, Median 8 (Middle one). This is Q2.
Q1 is the median that comes from the lower half of the data.
Now the lower half: 2,3,5,7,8
So, Q1 is 5.
Q3 is the median that comes from the upper half of the data.
Now, the upper half: 8,9,10,12,15
So, Q3 is 10.

Finding quartiles for an EVEN number of data within a dataset:

Data: 10,12,14,15,14,16,17,18,10,19,17,17
Order: 10,10,12,14,14,15,16,17,17,17,18,19
As the number of data is Even, so to find the median, we have to find the average of the middle two data.
Average of 15 and 16 is 15.5.
So, Median is 15.5. This is Q2.
Q1 is the median that comes from the lower half of the data.
Now the lower half: 10,10,12,14,14,15
Average of 12 and 14 is 13.
So, Q1 is 13.
Q3 is the median that comes from the upper half of the data.
Now, the upper half: 16,17,17,17,18,1
Average of 17 and 17 is 17.

So, Q3 is 17.

Now that we all understand 5-number-summary of data, let's go with a real-world challenge that can be easily solved and visualized with whisker boxplot using our 5-number-summary theory.

Example scenario:

Professor Jarvis took an exam on two sections (let us name them Section A and Section B) under his department with the exact same question paper. He wanted to determine which section does better in the exam than the other one. He is pretty wise, like us, wanted to visualize the performance through a whisker boxplot. Therefore, he has collected the 5-number-summary of both classes for that exam; the maximum, the quartiles and the minimum score in each class.

5-number-summary collected from both the classes
We need to find out which class has done better than the other, which is a very common sort of comparison we have to conduct in our daily life. We may not have to deal with the same example, but we do something very similar to this, I reckon it strongly.

However, to compare the overall performance of the two sections visually, we need to draw a whisker boxplot. Just plot the 5-number-summary of both classes A and B irrespective of any axes (I have chosen Y-axis here), just like the below one. Make a rectangle (or the BOX) with the quartiles, keeping the maximum and the minimum values outside of the box and join them with a straight line to Q3 and Q1 respectively.

(I didn’t have any arrangements while writing this story. So, tried in the simplest way possible, merely on a piece of white paper, Excuse me for that. Thanks to my beloved colleagues for managing at least a scale for me!)

From the chart above, it is pretty clear that section B has done better, as the cluster is more congested and the majority of the marks (inside of the box) remains higher/at the upper side compared to section A, regardless of having the same maximum value. On the contrary, section A’s boxplot is more dispersed, which means some of the students have cut a poor figure in the exam (statistically, 25% of the students from section B scored below 24, whereas, 50% of the students in section A scored below 22).

For the demonstration purpose, I have tried to draw it manually. However, this is a popular chart and almost all the BI tools (including Microsoft Excel) have built-in integration for whisker boxplot chart, which, to be honest, takes a few clicks to build the same chart digitally.

I hope all the words I have said above make sense and will help you understanding as well as utilizing the acquired knowledge for any purposes you might come across.
If you’ve liked it, please hit clap and follow me for further reads on statistics and Business Intelligence. I promise to produce better reads ahead.