What are QQ Plots?

Published in

Analytics Vidhya

5 min readOct 5, 2020

Supposedly you are given a random variable X with few observations or samples of the same and then asked either one of the following two questions:

Is the random variable X normally distributed?

OR?

Into what family of distribution does the given random variable X fall into?

This is where QQ Plots come into the picture and help us answer the above questions.

Though there are more statistical testing available which includes the KS test, AD(Anderson–Darling) test, etc; but QQ plots are one of the simplest graphical methods to answer these questions.

“In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.” — Wikipedia

QQ plots have majorly two applications which are the above two questions asked for any random variable.

Now the question arises as to why Normal Distributions are exclusively put as a separate question or as a fundamental application of QQ Plots?

Why Normal Distributions?

QQ plots can be used to check any type of distribution that a random variable belongs to, be it Exponential distribution, Pareto distribution, Uniform distribution, etc. What makes Normal distribution so special is that it is one of the most common distributions that occur in the natural environment. Also, the empirical rule (more commonly known as the 68–95–99.7 rule)which tells us the exact amount of data that lies in the range of 1st,2nd, and 3rd standard deviation from the mean, simply makes day to day work more efficient.

“In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively; more precisely, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard deviations of the mean, respectively.” -Wikipedia

Steps to plot QQ (Theoretical)

Given a random variable X, with 500 observations/samples. Sort all of these samples and compute their percentiles.

A percentile is a measure at which the percentage of the total values are the same as or below that measure. For example, 75th percentile is the value below which 75% of the observations may be found.

Quartiles divide a (part of a) data into four groups containing an approximately equal number of observations: Q1, Q2, Q3, and Q4. These four quartiles are the 25th, 50th, 75th, and 100th percentiles respectively.

Next, we will consider a random variable Y which has a Gaussian distribution. Let’s take 1000 samples of the same and similarly as above sort them and find their percentiles. These percentiles are called Theoretical quantiles.

We can take Y to be any kind of distribution and follow the same procedure as mentioned.

After this we will plot the percentiles of random variable X on the y-axis and the percentiles of Y on the x-axis, thus forming the Quantile-Quantile plot.

Each point on the plot corresponds to a percentile of Y versus the same percentile of the random variable X.

Conclusion: If all these points roughly lie on a straight line then X and Y random variables can be said to have a similar distribution or more precisely X will have a Gaussian distribution.

Let’s see this practically!

In this example, we will verify whether random variable X has a Gaussian distribution.

Here, X is a random variable with mean(loc) 20, standard deviation(scale) 5 and having 100 observations.

import numpy as np# generating 100 sanples from N(10,5)
X = np.random.normal(loc = 10, scale = 5, size=100)

Plotting a QQ plot by comparing it with standard normal variable N(0,1) and using the pylab library for plotting the graph.

import pylab 
import scipy.stats as statsstats.probplot(X, dist="norm", plot=pylab)
pylab.show()

Though the mean and standard deviation of the random variables in the X and Y axis are different; but as they come from the same family of distribution i.e gaussian distribution thus mostly all the points lie on the 45-degree line.

Now, let’s see what happens if the number of observations are increased in the random variable X.

# generate 1000 sanples from N(10,5)
X = np.random.normal(loc =10, scale = 5, size=1000) stats.probplot(X, dist="norm", plot=pylab)
pylab.show()

Now we see more points are falling on the red line as compared to the previous plot.

Let’s see what happens if we decrease the number of observations/samples.

# generate 50 sanples from N(10,5)
X = np.random.normal(loc=10, scale=5, size=50) stats.probplot(X, dist="norm", plot=pylab)
pylab.show()

A lot more deviation of the points has occurred from the red line as compared to the first plot.

Now, plotting the QQ plot but with 5000 samples!

# generate 50000 sanples from N(10,5)
X = np.random.normal(loc=10, scale=5, size=50000) stats.probplot(X, dist="norm", plot=pylab)
pylab.show()

We see that as the number of samples increases more and more points lie closer to the line.

Limitation of QQ plot

One of the major limitations of QQ plot is that as the number of observations or samples decreases it becomes more difficult to interpret the plot.

Another example

Here we are generating 100 samples from a uniform distribution and plotting a QQ plot against Y, which is a gaussian distribution.

# generate 100 sanples from uniform distr
X = np.random.uniform(low=-1, high=1, size=100) 
#plotting measurements against gaussian distr
stats.probplot(X, dist="norm", plot=pylab)
pylab.show()

As the distributions are different in the two axes (X-axis: gaussian and Y-axis: uniform), thus the points do not lie on the line and hence they are moving further away from the line and at the extreme end of the graph, the points diverge the most.

Now if we want to see a significant difference in the plot then we must use more number of samples.

# generate 5000 sanples from uniform distr
X = np.random.uniform(low=-1, high=1, size=5000) 
#plotting measurements against gaussian distr
stats.probplot(X, dist="norm", plot=pylab)
pylab.show()

Conclusion

Thus if most of the points lie on the line then we can conclude that two distributions on the x-axis and y-axis are from the same family and if they don’t then the random variable X belongs to a different distribution than that we are comparing with.