Data Visualization with Python and Seaborn — Part 5: Scatter Plot & Joint Plot

Random Nerd
7 min readFeb 4, 2019

This article is going to be pretty much in continuation to our previous article on Linear Regression plots where we have already worked comprehensively on various Scatter plots. The new concept that we shall look into today is Joint plot and shall simultaneously touch-base Scatter plots as well with few examples.

Scatter plots are identical to Line graphs that show how much one variable is affected by presence of another and this relationship between two variables is statistically termed as their correlation. As we’ve previously observed, closer the data points come when plotted to making a straight line, higher the correlation between those two variables, thus making their relationship stronger.

So, if the data points make a straight line going from the origin out to high x and y values, then variables are said to have a positive correlation BUT if a line goes from a high value on y-axis down to a high value on the x-axis, then the variables have a negative correlation.

This correlation generally ranges from -1 to 1, deciding whether it’s positive or not. And the best fitting, or say, highly accurate line is commonly measured using LSR, i.e. Least Squares Regression. Apart from correlation, another statistically relevant concept to be noted with Scatter plots is (Linear) Interpolation and Extrapolation.

Just to give you an overview, Interpolation is where we try to find a value inside our set of available data points; whereas, Extrapolation is where we find a value outside our set of data points. Hence, it is always advised to be careful with Extrapolation because it may fetch misleading results as we are in uncharted territory or in simpler words, dealing with assumptions.

On the other hand, in simpler words, Correlation is:

  • Positive when the values increase together.
  • Negative when one value decreases as the other increases.

Actually these are statistical concepts and I don’t really have to deal with them in a Seaborn series but I know how important it is to have a good understanding of these concepts, and not everyone comes from a statistical background. You would often find me trying to get you at least acquainted with those terms, so that you know what is being discussed and how to infer.

Let us revisit Scatter plot with a dummy dataset just to quickly visualize these two mathematical terms on a plot; and note that these concepts shall remain similar, be it Seaborn, Matplotlib, Bokeh, Plotly, or whichever package you’re using to plot your data. The dataset that I would be using for this demonstration is a random one, freely available online:

Let us now try to visualize Birth Rate against Average annual income using our Scatter plot:

Before I get into discussing more on the statistical terms, let me give away something that will always be useful for you even in production, and shall give you that edge over others. This is a list of Tableau colors that is available in Matplotlib and can be used in Seaborn as well, as guided earlier in Aesthetics lecture. Let us set it up:

Looking at this plot, we can easily say: As Annual income increases in a country, corresponding Birth rate automatically decreases in general. Certainly there are few outliers (like Bangladesh, etc.) as well in our dataset that we can see at top left, where the income is low but corresponding Birth Rate is exceptionally high. Note that here I have tried to plot a linear line across this dataset, so this fit isn’t the best. May be a curved line would have better tried to cover all the data points, that you may experiment with, as your homework.

In terms of values, we may easily say that the mean should be around an Annual Income of around $39,000.00 with respective Birth Rate close to 7.5. If we compare it with our mathematical stats, we do see a close comparision with .describe() method on our dummy dataset. In terms of parameters, scatter_kws helps us control the appearance of the scattered data points on our plot, using Matplotlib plt.scatter format. And, if we add line_kws, it would do the same for our line or curve passing through these data points. Additionally, the color parameter assigns the color and alpha determines transparency.

Well that pretty much ends our discussion on Scatterplot in particular, but time and again we shall keep revisiting as and when required. Let us now move ahead into another type of plot that presents the Logistic Regression aspect quite well. As mentioned earlier, this is going to be Joint Plot and this time let us begin with implementation before we delve into discussing it’s various aspects:

We already aware of the scattered distribution here so let us focus on what we have at top at and right spines. By the way, if we wish to reduce the size of scattered dots in our plot, we may chose to add sizes parameter as a Tuple like sizes=(20,0), like sns.jointplot(x="total_bill", y="tip", data=tips, color=tableau_20[7], kind="scatter", sizes=(20,0)). Now these spines represent Gaussian distribution of a sample space, which in our case comprises of Total bill and associated tips from our dataset. Let us fit the Density curve on top of it, that we had observed in the introductory lectures as well and while we do that, simultaneously we shall also switch from scattered distribution to something different:

The area below the density curve on the histogram is what statistically helps in calculating the PDF, i.e. Probability Density function and the highest peak of the curve is the mean of distribution. Gaussian or Normal or Binomial distribution is an age-old statistical concept that requires good amount of study to understand each and every aspect of it but what I told you is roughly the gist of it.

Next what we see above is just a variation in type of plot and is commonly termed as hexagonal bins, which replaces scatter dots to fill in for our data points. We may also replace these with Contour maps like this:

Now let us try to plot two distributions in a Joint plot:

Please do note that Joint plot is a figure-level function so it can’t coexist in a figure with other plots. But we do have our kde plot function which can draw a 2-d KDE onto specific Axes.

Here is a link attached for an article on Gaussian Distribution to gain better insight to what gets plotted on the top and right axes, but do note that it involves extensive mathematical calculations and requires a background in Linear Algebra and Probability to understand the underlying computations.

And for color choices, here is an image of available colors that you may use in all your Seaborn plots as it has underlying Matplotlib effect. The attachment shall also display shade preview for your convenience:

Okay! At last I would like to show another practical implementation of Joint plots, so let me plot it and then explain what I tried to achieve:

Actually the only difference in this plot from what we have been learning till now is that you can also overlay specific dots on this joint plot and accordingly assign a name to it. This gets quite handy when we need to point specifics in a presentation or something similar. We’re now well equipped to handle pretty much any scenario that would require a Joint plot or just a Scatter plot as well to achieve our goal. In the next lecture, we shall take up a new type of plot and try to infer. Meanwhile if you have any doubts, feel free to let me know. Till then Happy Visualizing!

Data Visualization with Python and Seaborn — Part 4

Data Visualization with Python and Seaborn — Part 6

--

--