Interpreting the data visualizations - Part I

It is said that people who analyze the data should know story-telling well, a better term for it is ‘Data Storytelling’.

Bilwa Gaonker
TheLeanProgrammer
5 min readApr 21, 2021

--

Just look at the pair plot below, looks beautiful, right? Look at those blue, red, and, green points nicely clustered to form groups…

Kaggle — Bilwa Gaonker

Wait!, you say this in an interview, sorry bud you ain’t fit for the job. Well, now you may ask, “Bilwa, what did you mean by good storytelling then?”. Oh, definitely not telling about the physical appearance of the plots, because that is already visible to your audience. You need to tell them the significant and relevant insights of the data points plotted in the graph i.e. interpret the data visualizations.

Interpreting the data

Let us cut to the chase to go about interpreting the pair plot, shall we?

  • My first go-to, while looking at the plot is the axes. Our math teacher was right after all to cut marks when we didn’t write the scale and axes names. Yes! labels on the x-axis and y-axis give us the basis of what we are trying to compare or relate.
    For instance, in the above pair plot, the x-axis and y-axis have 4 variables namely “sepal_length”, “sepal_width”, “petal_length”, and “petal_width” color-mapped to the different species of iris.
  • A pair plot is defined as the plot that depicts pairwise relationships in the dataset. Hence there are 16 plots inside this pair plot (plot twist! sorry, bad joke)
  • You must have noticed that the primary diagonal has univariate distribution plots whereas the rest are scatterplots. If you look close enough, our statistic friend namely correlation says hello!
    Correlation can be of three types: negative, positive, and no correlation (shown in the figure below).
    Wait!, What do these three types of correlation signify? Positive means that the two variables are moving in the same direction i.e increasing and decreasing together, negative means that they move in the opposite direction i.e. one increases and the other one decreases, and lastly no correlation means not related or “can’t say anything”.
Source: https://www.latestquality.com
  • Lastly, look at the color mapping and jot down the conclusions for our story.

Yes, I think this background should be enough to go about comprehending the pair plot. So, let’s begin! As the pair plot is based on the iris dataset, we need to narrow down the parameters that are useful for predicting the type of the species.

Further interpretation…

Let us first notice the primary diagonal that has the univariate distributions; the “sepal_length” and “sepal_width” distribution of different species seem to be overlapping on each other thus we can’t say that we could use these parameters for classifying the species.

The “petal_length” and “petal_width” distributions if you observe, the univariate distribution of iris-setosa stands out in both cases, thus welcoming our first conclusion, “The setosa species can be differentiated from the other two, with help of parameters: petal length (0–2 cms)and width(0–0.8 cms)”.

Look at the figure above, I have canceled out the plots that didn’t show any correlation(for your understanding please try comparing the first and second figure). In the first row, sepal_length vs petal_length we see that for both species there is a positive correlation between the variables.

Similarly in other rows, we can see the positive correlation between sepal_width and petal_width, petal_length and petal_width, petal_length and sepal_length, and petal_width and sepal_width. So how do I conclude in this case? Well, to get a proper analytical insight we need to use higher-level methods of machine learning like classifier algorithms.

But, we still can get a proper insight in this one, carefully observe the petal_length vs petal_width plot closely for versicolor and virginica,(the lines drawn by me is just to give you a better understanding of how it might look if we were to plot the line passing through the particular scatterplot), the two clusters aren’t overlapping and you could conclude that “the slope of the lines passing through the petal length vs petal width would be the ratio required to distinguish versicolor from the virginica”.

Similarly, the scatterplot of petal_width vs petal_length has a positive correlation only for versicolor species, so our third conclusion is that “the slope of the line passing through the scatter plot of versicolor’s petal width and length would be a distinguishing character for it.”**

Yes, this is how we can interpret any given pair plot and extract the conclusions required for the ‘storytelling’ part. As I said earlier, as we dive deeper into the data, we tend to have more analytical insight into the pair plot. But the whole motive of this article was to convey how pair plot works, what can you possibly interpret from it and why do you need a pair plot. Hoping to interpret more types of plots in my next articles. Feel free to leave any feedback/suggestions for this article!

I have been saying “plot the line”, “slope of the line” quite frequently in the paragraph marked **; yes, I am talking about ‘Linear Regression’ (spoiler for more upcoming articles!).

Feel free to reach out to me on LinkedIn in case of any queries :) Stay safe and take care during these tough times!

Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer

--

--

Bilwa Gaonker
TheLeanProgrammer

Love playing with data | Ardent Reader | I write newsletters sometimes