Q-Q Plots, Scatter Plots, Pair Plots — Where to use? How to use?

Nishesh Gogia
Analytics Vidhya
Published in
6 min readJan 3, 2020

Data visualization is one of the important step in solving a problem as a data scientist. We get deep insights of data when we see it visually. Whether it is correlation of variables, collinearity of variables data visualization is the most important step.

Data visualization simply means plotting the distribution and plotting distribution involves different plots.

BASICS, HUH!

In this article we will see different types of plot, why do we need them?why not one plot is sufficient to visualize the distribution? What makes scientist discover different plots?

TYPES OF PLOT

  1. Scatter Plot
  2. Strip Plot
  3. Pair Plot
  4. Q-Q Plot

There are many more but for this article we gonna focus on these four.

SCATTER PLOT

In this there is an use of cartesian coordinates to display values for typically two variables for a set of data.

or

The Purpose is to identify the type of relationship between two or more quantitative variables.

For example -lets say a doctor wants to see the relationship between a person’s breath holding time and its lung capacity(here we assumed that lung capacity is measured as a real number)

This is an example of 2D scatter plot.

3D Scatter Plot

Plotly has amazing inbuilt libraries to plot the three dimension plot, it will not be easy to tell you everything about 3D plots in one article but for those who are interested to learn more about 3D plots can search Google CONTOUR PLOTS.

You will get an idea about 3D Scatter plots.

4D/5D/6D Scatter Plots

Now imagine visualizing something in 4D or 5D, we won’t be able to retrieve any information from the plot because it will very complex to understand and as we know in role of data scientist “ Model Interpretablity” is very Important.

So to solve this problem some smart people introduced PAIR PLOTS

What is Pair Plot?

Rather than defining, let me put a picture, you will easily relate to it and you will understand how pair plot is solving the problem of higher dimension scatter plots.

Exactly!!!! Pair plot is a matrix type distribution showing scattering of data points of every possible pair of features.

Personally i remember it as a scatter plot of every pair of features.

That means if there are n dimensions or features, Pair Plot simply gives us the matrix of n*n size.

Seaborn has a very simple one line code for Pairplots

import seaborn as sns

sns.pairplot(data= “ ”, hue= “ ”,size=3)

FAIR ENOUGH!!!!!

Let me give you a situation, lets imagine you have 100 features and you want its pair plot, now 100*100 will be lots of plots

It would be really difficult to go through every plot and make sense out of it.

So We have a problem, Right!!!!

So to solve this problem We have something called Dimension Reduction and in that we have techniques called PCA and T-SNE to visualize data when there is a high dimensional data set.

I can’t cover Dimensional Reduction here because it is very big topic itself but i already did write a article on Geometric Intution of PCA AND T-SNE.

You will be easily able to connect the dots after reading that.

For now just remember one simple thing

Low dimension data(upto 10 features) — Use PAIR PLOTS

High Dimensiom data(100 features or more) — USE DIMENSION REDUCTION

Now What is Q-Q plot then and why do we need it??

It stands for Qunatile Quantile Plot.

Before going further i am assuming that you know what is a Gaussian Distribution or Normal Distribution, if not just know some simple facts about Gaussian Distribution.

  1. Mean, Median and mode is same in this distribution
  2. Bell shaped, symmetric through mean.
  3. Mean is “0" and standard deviation is “1"
  4. This distribution has been widely studied by scientist and we have enough information about this distribution to shape our model into a good machine learning model.

So if we somehow got to know that our distribution is Gaussian, We can build a great machine learning model.

BUT HOW TO DETERMINE WHETHER A DISTRIBUTION IS GAUSSIAN OR NOT?

There are two methods to determine that:-

  1. Q-Q Plot
  2. K-S test(We ll study this later)

So we got the intution Why we need Q-Q plot.

How Q-Q plot determine?

So lets assume we have a random variable X and we take 500 observations out of them, lets say x1, x2…..x500.

HERE WE DO NOT KNOW THE DISTRIBUTION OF X, AT THE END OF QQ PLOT WE SHOULD KNOW IS IT NORMAL DISTRIBUTED OR NOT.

STEPS TO FOLLOW

1. Sort xi’s in ascending order and find percentile

(if you dont know how to find percentile or what is exactly percentile, lets assume i have 100 values and i sort them into ascending order.

X={x1,x2,x3….x100}, here x1<x2<….x100

In this set, lets say i am ranking each value from 1 to 100 so first value will get rank 1 and the last value will get rank 100.

I can say that below the value of x10 or below the value of 10th rank or 10th percentile, 10% of the values lies and above x10 or above 10th percentile, 90% of the value lies.

That is the meaning of percentile.

so we will get 100 percentile values for the orginal 500 samples

x5,x10,x15,….x500{these are the percentile values}

x5 is the value below which only 1% of the values lies(because here the sample size is 500 not 100)

x10 is the value below which only 2% of the value lies

x25 is the value below which only 5% of the value lies

2. Second step is to create a Random Variable Y which has a Normal Distribution and has a mean=0 and standard deviation =1.

Again we will take 500 observation, sort them and find their percentile

so lets say we have y1,y2,y3…y100(same as we did with our original distribution X)

LET ME REMIND YOU WE DON’T KNOW WHAT IS THE DISTRIBUTION OF X, THAT’S WHY WE ARE USING Q-Q TEST TO DETERMINE WHETHER THE DISTRIBUTION OF X IS GAUSSIAN/NORMAL OR NOT.

3. Third step is to plot QQ plot between X and Y

so we have {x1,y1},{x2,y2},{x3,y3}……{x100,y100}

we will plot and if all the points lie in the same line, it means X is NORMALLY DISTRIBUTED but need not have mean= 0 and standard deviation =1.

if all points does not lie in the same line, it means X is not NORMALLY DISTRIBUTED.

In the picture below points are deviating in the end, it means sample quantiles is not normally distributed.

CODE TO PLOT

scipy has a inbuilt tool to plot QQ plot

import scipy.stats as stats

import pylab

stats.probplot(Y, dist= ‘norm’,plot=pylab)

pylab.show

In next article We will cover what is the neccesary of histograms and pdf(probability density function)?, what is kernel density function? , why CDF(cumulative density function) is the most important plot while solving a Machine Learning problem.

THANKS FOR READING…

Nishesh Gogia

--

--