TheLeanProgrammer
Published in

TheLeanProgrammer

Interpreting data visualizations-II

Understanding the good old “Whisker and the Box” plot…

In part I, we saw how can we interpret plots properly to get insights and how the pair plot of the iris dataset shows us the correlation of all 4 variables that pave the path towards the classification of the species.

But, this plot does not really give the information of the dataset we have used. Having a nice background of how the dataset is, makes the data storytelling even better!

Hence in today’s article, we are going to talk about the Whisker and the Box Plot (rings any bells? Yes, the boring statistics class again!)

Source: https://www.simplypsychology.org

Well, it’s one of the most descriptive plots that give out the direct 5-number summary about the data you are plotting namely:

  • Minimum score,
  • First Quartile,
  • Median,
  • Third Quartile, and
  • Maximum Score.

The box part includes all the quartiles whereas the whiskers represent the minimum and maximum score (Refer the figure below).

Nerd talk (brief definitions of the measures)…

Let me brief you about the details of the box plot so that we can interpret it easily. I am pretty sure everyone knows that the median lies in the middle of the dataset i.e it divides the data into half. It is usually not affected by the outliers (extreme values) and hence is considered an important parameter.

But, the median just provides us the mid-value, we really don’t get the idea about how far it is spread before and after median. Quartiles come to our rescue here!

Figure 1: Whisker and Box Plot

Quartiles distribute the dataset such that 25% of the data is below the first quartile, 50% of data is below the median (second quartile) and 75% of the data lies below the third quartile. Interquartile range (IQR) is calculated by subtracting the third quartile from the first quartile. Why did I mention IQR? The minimum score is calculated like this:

(first quartile-1.5*IQR)

and maximum score is calculated similarly,

(third quartile+1.5*IQR)

The IQR also tells us about the variability of the data about the median of the dataset.

Minimum and Maximum scores are important to filter out the extreme values in the dataset (Yes, talking about outliers. They can add bias to the analysis you are carrying out!)

Now, we have understood all the statistical measures, now let us dive into the box-plot below. Since we already got the insights from the pairplot for the iris dataset, we’ll just use this box plot to go around interpreting box plot given any parameters and talk about possibilities of the interpretations.

Unboxing the box plot...

Let us take a moment to observe all the box-plots given in Figure 2. What do you see? Did you notice the first box-plot to be the lowest of all? Yes, the above statement is not completely relevant but is significant. Rather than looking at how low or high the box plot is, look at how high the median is from the other boxes (IQR) of the box plot.

It is highly likely that there is a major difference between the two groups. And in this case, we see that all the box plots have a major difference amongst each other.

The takeaway from this?

All the species box plot medians are not overlapping hence there is a huge difference among the data of sepal length collected.

Now, this was with respect to the data collected, what if we had a similar box plot for opinions collected from people about some issue. Then we could say something like, people from group 1 feel way more differently than group 2 and 3.

Figure 2

Figure 2 had quite uniform box plots with the median being located roughly in the middle of the box. This tells us that the dataset is symmetric in nature i.e. the mean is equal to the median in such cases.

Now you may ask, what does it mean if the median is not exactly in the middle like the blue and purple plot in Figure 1? and why is the shape of the boxes vary so much? Why is the red one so compressed and the blue plot is elongated? Also, why is the length of whiskers varying so much?

Let us get all of these questions cleared by diving deeper into Figure 1. It’s okay if you can’t swim in this case (Never ending bad jokes it is, sigh)

Diving deeper…

The shape of the box plot gives us a beautiful acumen of the dispersion of the dataset. Dispersion is another name for variability and spread of the data (how far the distribution of the data is stretched or squeezed). Hence we conclude longer the box plot more is the spread, and shorter or squeezed box plot means the data is less dispersed.

From figure 1, we can conclude that petal length data is more dispersed than any other parameters. This ‘dispersion of data’ makes more sense in cases of emotion scores, opinions etc. because that tells us that certain topic people tend to have similar opinions (hence the squeezed box plot) and other topic people have varied and biased opinions (elongated box plot with some outliers).

The varying length of the whiskers just means that, that segment (the 25% we talked about) has more dispersed data than the other parts of the box plot. Like longer the whisker more the spread and vice versa.

Figure 3

Notice in Figure 3, given above that the blue box plot has its median quite away from the center (towards the third quartile). Does this mean the sample size situated under the median is more? But before we said the nice 25% of sample size each rule? All the above questions are valid, and this is the most common mistake made while interpreting the box plot.

If one quartile is longer than the other it DOES NOT mean that that quartile contains more data points! It just means that the data points are quite distanced and varied from each other thus covering more area. Box plot CANNOT imply the sample size we are using. So, the 25% rule stands NO MATTER WHAT!

Also the reason median is away from the center is that the data is skewed. Now the above point will make more sense than ever, this is the beauty of statistics.

Coming back to the point, if the longer part of the box is situated below the median means that the data is skewed to the left, and if it’s above, then the data is skewed to the right. Refer the figure below to get more understanding about the data being skewed to right and left.

Source: https://www.statisticshowto.com

Figure 3 explains how the three parameters are situated in the case of skewed distributions. But Bilwa, the reasoning out of this? Yes, it is highly contextual, like in collecting the sentiment of people over let us say a political party; so if I consider the blue box plot as A and red box plot as B in the figure, I can say that people are having a varied opinion about A and some are highly biased whereas, in case of B, all people have the same opinion about it except for the few extreme opinions (yes, the little circles, our outliers).

More skewness…

Let us look in the context of finance where the concept of skewness is utilized in the distribution and prediction of returns on certain investments (stocks, property, bitcoin, etc.) Although many financial models assume that the returns follow a normal distribution, in reality, the returns are usually skewed.

The positive skewness of a returns distribution can imply that the investor may expect frequent small losses and a few large gains from the investment (depends on the years, of course!).

The positively skewed distributions of investment returns are generally more desirable by investors as there is a good amount of probability to gain huge profits that can cover the frequent small losses. Now we know what to say if we see a box plot on returns, or what to tell the broker about prices of the certain property going to increase overtime etc.

The skewness of the data can also tell us whether the data collected for our analysis is right or wrong? Like for example if you are training a model for detecting chest disease from X-Rays, we can’t give it one disease X-Ray over and over again just because we have the data with us. Thus guiding us to clean and preprocess the data.

Looking out for outliers

This part becomes super convenient with the box plot as the outliers are situated and plotted with cute circles outside the whiskers thus removing the unnecessary bias in our data.

Final conclusion…

The box plots give out great acumen about the center, dispersion, symmetry, and skewness of the dataset that can be formulated into stories according to the context given (Data set, Finance, Emotion scores, Poll, etc.) This plot is best used to compare groups, or order the groups with help of any statistical measure.

YESS! REMEMBER GARFIELD??

Phew, way too much for one plot Bilwa. Yes, this plot looks way too simple, but literally carries such a good amount of information about the data plotted. Next time you see a box plot, you know what secrets it is carrying :))

Stay safe and take care of your loved ones in these tough times! Feel free to reach out to me in case of any queries on LinkedIn.

Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bilwa Gaonker

Bilwa Gaonker

136 Followers

Curious about new technology | Host of podcast ‘Pep and Prep talks!’| Love to read articles that enlighten me with new topics!