Interpreting data visualizations- III

Bilwa Gaonker
TheLeanProgrammer
Published in
5 min readMay 13, 2021

--

In the last article, we dived deep into the interpretation of box plots that gave us insights into the median, skewness, and dispersion of the given data. But we did face an issue with them, we couldn’t tell where the frequency of the data points is the highest basically the distribution of the data. But Bilwa, since we know how the skewness and dispersion of the data are like, can’t we just directly make conclusions out of it? Fair question I must say, but there’s one thing we are naturally assuming here, which can be fairly wrong. (Considering data can be of any type)

All Joey fans, gimme a cheer! Now, let’s get back to our talk :))

For the whole article I blabbered about the Median and Mean of the data, we did forget to invite Mode to our party. Well, it's kinda box plot’s fault? No, it is the still most useful plot when it comes to giving out the insights, and the invitations were kind of our responsibility (sigh, so irresponsible!). Okay relax now, let us study about Mode a bit so that we can think of what to do, okay?

Let’s take a swim in the statistical waters

We just assumed that our dataset follows the normal distribution curve i.e. the data has only one mode or a good mathematical term for it is unimodal distribution.

Source: Scribbr

It is highly possible that the distribution of data can be bimodal or multimodal too.

A distribution with two peaks is called bimodal distribution and a distribution with two peaks or more is multimodal.

What does this mean then? A simple way of saying would be it does not follow the normal distribution curve and hence it will require a different complex method of statistical analysis. And in most cases, the analyst needs to have a closer look at the data to make sure that multiple distributions aren’t overlapping.

Wait, but how to invite Mode now?

Mode definitely seems to be all elegant, classy, and really a VIP (Very Important Parameter) from what we saw above. I think I know what would make mode happy once we invite it over! I read in it’s interview that Mode loves Violin music! Voila, guess who’s coming to our rescue this time? IT IS THE VIOLIN PLOT 😮

Who is this now? why is it so special? and will the box plot get along with this unknown plot? Let us finally start with the introduction and interpretation of the Violin plot. Shall we?

Introduction to Violin plot…

Violin plot is a viz that is very similar to box plot with probability distributions are added to the plot rotated on each side. Now some stats buff might point out that how can you possibly plot a probability distribution function without a parameter. Well, this plot uses the Kernel Density Estimator to determine this PDF(probability distribution function). It is defined by Wikipedia as,

KDE is a non-parametric way to estimate the PDF of a random variable. It is a fundamental data smoothing problem where inferences about the population are made, based on finite data sample.

Figure 1: FACE REVEAL!! (Source: Mode)

The plots are as elegant as the Violin in real life. Also yes, you observed it right! A Violin plot is basically a box plot along with a PDF of the data. Figure 1(left) shows the box plot features(which we already know!) and the Figure 1 (right), broader distribution depicts that a higher probability exists that our variable will take those values i.e. the section of violin holds a higher frequency of data points, thus taking care of all the modal properties we researched about before. (Now, you know why Mode loves Violin music!)

Then why just not plot the probability distributions with density plots? Density plots tend to overlap most of the times and you don’t want to go back and see what color depicts what set of data! Hence, violin plots are very convenient (especially while comparing and getting insights, look at Figure 2 below).

Interpretation of Violin plots-

Figure 2: Species vs Petal Length

It’s elegantly showing us the distribution of the petal lengths we have in our iris dataset. One depiction is clear that setosa species have most of their petal lengths value to be 1.5cms(approx). Violin plots also make the ranking easier as we know the distributions well! For instance, here we can say that Virginica has the highest petal length among the 3 species.

For better interpretation and since all of us are used to looking at PDFs horizontally, we can plot it horizontally too.

Figure 3: Petal Width vs Species

From Figures 2 and 3, we can conclude that setosa species have fairly small petal features than Versicolor and Virginica. On the contrary, Virginica has the largest petal features among them. We can also observe that Virginica and Versicolor petal width values are spread out and we cannot really classify them based on petal width.

The below figure (Figure 4) also follows a similar trend to Figures 2 and 3. The most peculiar thing to see here is that Virginica’s sepal length is overlapping with the others.

Figure 4: Sepal Length vs Species

Figure 5 depicts that setosa has a fairly larger sepal width than the other two species, which is quite contrasting to the trend we saw before, right? This is how the violin plot brings to light some really good insights with the PDF feature added to it and we already know how insightful the box plot in itself is!

Figure 5: Sepal Width vs Species

See the magic that the violin plot exerts over the visualization? I mean who wouldn’t want to blow away their audiences with such a beautiful plot, right? We also learnt how to interpret it if we are given one. Stay tuned for more interpretations like this! You can connect with me on LinkedIn if you have any queries related to my articles.

Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer

--

--

Bilwa Gaonker
TheLeanProgrammer

Love playing with data | Ardent Reader | I write newsletters sometimes