Crash Course in Data — Seeing is Believing: Exploring Data Distributions with Q-Q Plots

Published in

AI Skunks

10 min readMar 14, 2023

In statistics, a Q–Q plot (quantile-quantile plot) is a probability plot, a graphical method for comparing two probability distributions by plotting their quantiles against each other.
A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate).
This defines a parametric curve where the parameter is the index of the quantile interval.

The quantiles of the first data set are plotted against the quantiles of the second data set in a q-q graphic. A quantile is the percentage of points that fall below the specified number. In other words, the 0.3 (or 30%) quantile is the value at which 30% of the data are below it and 70% are above it.

Moreover, a 45-degree reference line is plotted. The points should roughly lie along this reference line if the two sets are drawn from a population with the same distribution. The further the two data sets deviate from this reference line, the more evidence there is that they came from populations with different distributions.

import numpy as np

#create dataset with 100 values that follow a normal distribution
np.random.seed(0)
data = np.random.normal(0,1, 1000)

#view first 10 values
data[:10]array([ 1.76405235,  0.40015721,  0.97873798,  2.2408932 ,  1.86755799,
       -0.97727788,  0.95008842, -0.15135721, -0.10321885,  0.4105985 ])import statsmodels.api as sm
import matplotlib.pyplot as plt

#create Q-Q plot with 45-degree line added to plot
fig = sm.qqplot(data, line='45')
plt.show()

The q-q plot is formed by:

Vertical axis: Estimated quantiles from data set 1
Horizontal axis: Estimated quantiles from data set 2

The units on both axes correspond to the corresponding data sets. The real quantile level is not plotted, in other words. The quantile level for a specific point on the q-q plot is known to be the same for both points, but its precise value is unknown.

The q-q plot is essentially a plot of sorted data set 1 against sorted data set 2 if the data sets are the same size. The quantiles are often chosen to match the sorted values from the smaller data set in cases when the data sets are not equal in size, and the quantiles for the bigger data set are then interpolated.

The following inquiries are addressed with the q-q plot:

Are there common distributions between the two data sets?
Do the scale and location of two data sets match up?
Do the distributional forms of two data sets compare?
Do the tails of two data sets compare?

Usage of Q-Q plot

Q-Q plots (Quantile-Quantile plots) are a type of graphical technique used to compare two probability distributions. The primary use cases of Q-Q plots are:

Testing for normality:

Q-Q plots are commonly used to test if a dataset follows a normal distribution. If the data points fall close to the straight line on the Q-Q plot, it indicates that the data follows a normal distribution. If the data points deviate from the straight line, it suggests that the data does not follow a normal distribution.

Comparing distributions:

Q-Q plots can be used to compare two distributions to see if they are similar or different. If the two distributions are similar, the Q-Q plot will show the data points falling close to the straight line. If the two distributions are different, the Q-Q plot will show the data points deviating from the straight line.

Identifying outliers:

Q-Q plots can be used to identify outliers in a dataset. Outliers are data points that are significantly different from the other data points in the dataset. These data points will appear as points that deviate from the straight line on the Q-Q plot.

Checking for linearity:

Q-Q plots can be used to check the linearity of a relationship between two variables. If the data points fall close to the straight line on the Q-Q plot, it suggests that the relationship between the two variables is linear.

Assessing model assumptions:

Q-Q plots can be used to assess the assumptions of statistical models. If the residuals (the difference between the predicted values and the actual values) of a model follow a normal distribution, the Q-Q plot of the residuals will show the data points falling close to the straight line. If the residuals do not follow a normal distribution, the Q-Q plot will show the data points deviating from the straight line.

Overall, Q-Q plots are a useful tool for visualizing and comparing probability distributions, and can help identify patterns and anomalies in the data.

1. Testing for normality

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def qqplot_normality_test(data):
    """
    This function generates a QQ plot for testing normality of a given data sample.

    Parameters:
        data (numpy array or pandas series): The data sample to test for normality.

    Returns:
        None
    """
    # Sort the data in ascending order
    sorted_data = np.sort(data)

    # Compute the expected quantiles for a normal distribution
    expected_quantiles = stats.norm.ppf(np.linspace(0.01, 0.99, len(data)))

    # Generate the QQ plot
    plt.figure(figsize=(8, 6))
    plt.scatter(expected_quantiles, sorted_data)
    plt.xlabel('Expected Quantiles (Normal Distribution)')
    plt.ylabel('Observed Quantiles (Data Sample)')
    plt.title('Q-Q Plot for Normality Test')
    plt.show()

This function takes a data sample as input and generates a QQ plot for testing the normality of the data.

First, the function sorts the data in ascending order using the np.sort() function. It then computes the expected quantiles for a normal distribution using the stats.norm.ppf() function, which computes the inverse of the cumulative distribution function for a normal distribution. We generate 100 quantiles between 0.01 and 0.99, which should be sufficient for most cases.

Finally, the function generates the QQ plot using the plt.scatter() function and adds labels and a title to the plot using the plt.xlabel(), plt.ylabel(), and plt.title() functions. The resulting plot will show how closely the data sample follows a normal distribution. If the points fall along a straight line, it suggests that the data is normally distributed. If the points deviate from the straight line, it indicates that the data is not normally distributed.

To use this function, you simply need to pass in a numpy array or pandas series containing your data sample. For example, if you have a numpy array called my_data containing your data, you would call the function like this:

my_data = np.random.normal(0,1, 1000)
qqplot_normality_test(my_data)

# Set the parameters of the beta distribution
alpha = 2
beta = 5

# Generate a sample of 1000 random numbers from the beta distribution
sample = np.random.beta(alpha, beta, size=1000)qqplot_normality_test(sample)

The first graphs forms straight line and confirms that it is a normal distribution but the second graph is a curve which proves that QQ plot can be used as a test of Normality.

2. Comparing Distributions

Q-Q plots can be used to compare any probability distribution to a reference distribution, including beta distributions.

To create a Q-Q plot for a beta distribution, you would first need to generate a random sample from the beta distribution that you want to compare to the reference distribution. Then, you would plot the quantiles of the sample against the quantiles of the reference distribution on a scatter plot.

The Q-Q plot will show how well the beta distribution fits the reference distribution. If the data points on the Q-Q plot follow a straight line, it suggests that the beta distribution is a good fit for the reference distribution. If the data points deviate from the straight line, it indicates that the beta distribution is not a good fit.

It’s worth noting that beta distributions can take on different shapes depending on the values of the parameters alpha and beta. Therefore, it may be useful to create multiple Q-Q plots with different parameter values to compare how well the beta distribution fits the reference distribution under different scenarios.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Set the parameters of the beta distribution
alpha = 2
beta = 5

# Generate a sample of 1000 random numbers from the beta distribution
sample = np.random.beta(alpha, beta, size=1000)

# Generate the theoretical quantiles for a beta distribution with the same parameters
beta_dist = stats.beta(alpha, beta)
quantiles = beta_dist.ppf(np.linspace(0.01, 0.99, 100))

# Create the Q-Q plot
plt.figure(figsize=(8, 6))
stats.probplot(sample, dist=beta_dist, plot=plt)
plt.plot(quantiles, quantiles, color='red', linestyle='--')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
plt.title('Q-Q Plot for Beta Distribution')
plt.show()

In this code, we first set the parameters alpha and beta for the beta distribution. We then generate a sample of 1000 random numbers from the beta distribution using the np.random.beta() function.

Next, we generate the theoretical quantiles for the beta distribution with the same parameters using the stats.beta.ppf() function, which computes the inverse of the cumulative distribution function for the beta distribution. We generate 100 quantiles between 0.01 and 0.99.

Finally, we create the Q-Q plot using the stats.probplot() function, which takes the sample and the reference distribution as arguments. We also plot the line of perfect fit using the plt.plot() function, and add labels and a title to the plot using the plt.xlabel(), plt.ylabel(), and plt.title() functions.

This code will create a Q-Q plot showing how well the sample of data fits a beta distribution with the given parameters. You can change the values of alpha and beta to create Q-Q plots for different beta distributions.

3. Identifying outliers

This function works in a similar way to the previous function, but with an additional step for identifying outliers. After computing the expected quantiles and sorting the data, the function computes the absolute differences between the observed and expected quantiles using np.abs(). It then computes the median absolute deviation (MAD) of these differences using np.median().

The MAD is a robust measure of dispersion that is less sensitive to outliers than the standard deviation. We can use the MAD to identify outliers in the QQ plot by drawing a horizontal line at a distance of 1.5 times the MAD from the expected quantiles. Any points that fall above or below this line may be considered outliers.

To generate the QQ plot with the outlier detection line, we simply add a plt.plot() call after the scatter plot. The plt.plot() call specifies the endpoints of the line as the first and last expected quantiles, and the MAD times 1.5 above and below these quantiles. The resulting plot will show the data points along with the expected quantiles and the outlier detection line.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def qqplot_outliers(data):
    """
    This function generates a QQ plot for identifying outliers in a given data sample.

    Parameters:
        data (numpy array or pandas series): The data sample to test for outliers.

    Returns:
        None
    """
    # Sort the data in ascending order
    sorted_data = np.sort(data)

    # Compute the expected quantiles for a normal distribution
    expected_quantiles = stats.norm.ppf(np.linspace(0.01, 0.99, len(data)))

    # Compute the absolute differences between observed and expected quantiles
    absolute_diff = np.abs(sorted_data - expected_quantiles)

    # Compute the median absolute deviation
    mad = np.median(absolute_diff)

    # Generate the QQ plot
    plt.figure(figsize=(8, 6))
    plt.scatter(expected_quantiles, sorted_data)
    plt.xlabel('Expected Quantiles (Normal Distribution)')
    plt.ylabel('Observed Quantiles (Data Sample)')
    plt.title('Q-Q Plot for Identifying Outliers')
    plt.plot([expected_quantiles[0], expected_quantiles[-1]], [expected_quantiles[0] - mad, expected_quantiles[-1] + mad], color='r')
    plt.show()qqplot_outliers(my_data)

4. Checking for linearity

This function takes two arguments, x and y, which should be numpy arrays or pandas series containing the values of the two variables you want to check for linearity. The function computes the expected quantiles for a normal distribution using stats.norm.ppf(), and sorts the data for each variable in ascending order using np.sort(). It then computes the quantiles for each variable using stats.rankdata(), which converts the sorted data to a percentage of the total number of observations.

The QQ plot is generated using plt.scatter(), with the expected quantiles on the x-axis and the quantiles for the second variable on the y-axis. We add a red line to the plot using plt.plot(), which represents a perfectly linear relationship between the two variables. If the scatter plot of the quantiles closely follows the red line, it suggests that the two variables are linearly related.

To use this function, you simply need to pass in the numpy arrays or pandas series containing your two variables. For example, if you have numpy arrays called x and y containing your variables, you would call the function like this:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def qqplot_linearity(x, y):
    """
    This function generates a QQ plot to check for linearity between two variables.

    Parameters:
        x (numpy array or pandas series): The values of the first variable.
        y (numpy array or pandas series): The values of the second variable.

    Returns:
        None
    """
    # Compute the expected quantiles for a normal distribution
    expected_quantiles = stats.norm.ppf(np.linspace(0.01, 0.99, len(x)))

    # Sort the data for each variable in ascending order
    sorted_x = np.sort(x)
    sorted_y = np.sort(y)

    # Compute the quantiles for each variable
    quantiles_x = stats.rankdata(sorted_x) / len(sorted_x)
    quantiles_y = stats.rankdata(sorted_y) / len(sorted_y)

    # Generate the QQ plot
    plt.figure(figsize=(8, 6))
    plt.scatter(expected_quantiles, quantiles_y)
    plt.xlabel('Expected Quantiles (Normal Distribution)')
    plt.ylabel('Observed Quantiles (Second Variable)')
    plt.title('Q-Q Plot for Checking Linearity')
    plt.plot([0, 1], [0, 1], color='r')
    plt.show()x1 = np.random.normal(1,2,1000)
x2 = np.random.normal(1,2,1000)
qqplot_linearity(x1, x2)