Numerically generate important probability distributions using random numbers

MCMC Addict
6 min readDec 7, 2023

--

Understand where and when to use the distributions

Photo by Edge2Edge Media on Unsplash

When we first studied statistics, after we understood the histogram, we moved from discrete to continuous distributions. Then, we came to understand the probability density function. Many probability distributions comprise statistics. Some of them have an easy-to-imagine shape: rectangular (uniform), triangular, or trapezoidal distribution.

For the other distributions, the shape is not hard to remember. Still, it is not easy to understand the derivation: for instance, the normal distribution, the Student t-distribution, the chi-squared distribution, and the F-distribution. In practice, even if we have forgotten their derivation, we should at least be able to remember where or when they are used. In this article, I will show you how to generate such distributions from samples of a uniform distribution using Python code.

Generation of normal distribution from two samples of a uniform distribution of [0, 1]

First, let me try to generate the most important distribution, the normal distribution, using samples from a uniform distribution of [0, 1], where we need the Box-Müller transformation. This transformation is similar to the coordinate transformation from Cartesian to polar coordinates. It requires two independent random numbers corresponding to the Cartesian coordinate, each of whose output components in the polar coordinate follow an uncorrelated normal distribution between them as follows:

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st

N=1000000;bins=100

# np.finfo(float).tiny is to replace zero to avoid error caused by log(0)
# two samples from uniform distribution
u1=np.random.uniform(size=N)
u2=np.random.uniform(np.finfo(float).tiny, 1, N)

# Box-Muller transformation makes rcos and rsin to follow normal distribution
rcos=np.sqrt(-2*np.log(u1)) *np.cos(2*np.pi*u2)
rsin=np.sqrt(-2*np.log(u1)) *np.sin(2*np.pi*u2)

# rcos and rsin are independent: correlation coeficient is about zero
print(f'correlation coeff = {np.corrcoef(rcos, rsin)[0,1]:.3f}')

fig, ax = plt.subplots(1,2, figsize=(12, 4))
bins_array=np.linspace(-6, 6, bins)
r = [rcos, rsin]
title = ['using cosine component', 'using sine component']
for i in range(2):
ax[i].hist(r[i], bins, density=True)
ax[i].set_title (title[i])
ax[i].plot(bins_array, st.norm.pdf(bins_array))
ax[i].legend (('simulation','theory'))
plt.show()
Standard normal distributions obtained by the Box-Muller transformation

Generation of Student’s t-distribution from samples from normal distributions

The Student’s t-distribution is widely used in statistical analyses such as assessing the statistical significance of the difference between two sample means, constructing confidence intervals, and in linear regression. Let (x_1, …, x_n) be independent and identically distributed from a normal distribution (μ =0, standard deviation = σ). The t-distribution is obtained by plotting the histograms of the samples with the t-statistic, which is given by

where x_bar is the sample mean, s is the standard deviation, and t_n-1 means the t-distribution with a degree of freedom of n-1. The following code reflects the above equation with μ = 0. Note that there are two possible methods. You can see the agreement between the simulation and the theory for different degrees of freedom.

dfs=[2, 10, 50, 100]                          #df: degree of freedom
bins=200
bins_array=np.linspace(-6, 6, bins)
fig, ax = plt.subplots(1,2, figsize=(12, 4))

legend = []
for df in dfs:
# 2-D array using reshape drastically speeds up computation time
# Method 1
rv=np.random.normal(size=(df+1)*N).reshape(df+1, N)
# histogram of the below t-static to be plotted
rvst=np.mean(rv, axis=0)/(np.std(rv, axis=0)/np.sqrt(df+1.))

# Method 2
# above 2 lines can be replaced by 3 lines below
# rv=np.square(np.random.normal(size=df*N).reshape(df, N))
# rvs=np.sum(rv, axis=0)
# rvst=np.random.normal(size=N)/np.sqrt(rvs/df)

y, x = np.histogram(rvst, bins=bins_array, density=True)

ax[0].plot(x[0:-1], y)
ax[1].plot(bins_array, st.t.pdf(bins_array, df))
legend.append(f'\N{GREEK SMALL LETTER nu}={df}')
ax[0].legend(legend); ax[1].legend(legend)
ax[0].set_title ('simulation'); ax[1].set_title ('theory')
plt.show()
Student’s t- distributions for different degree of freedoms

Generation of chi-squared distribution from samples from normal distributions

The chi-squared distribution is used in chi-squared tests for the goodness of fit of an observed distribution to a theoretical distribution and in finding the confidence interval for estimating the population standard deviation of a normal distribution from a sample standard deviation. If x_1, …, and x_k are independent random variables from the standard normal distribution, then the sum of their squares is distributed according to the chi-squared distribution with k degrees of freedom. The following code plots the histogram of the values obtained by summing the squares of the samples from the standard normal distributions for different degrees of freedom.

dfs = np.arange(1, 11)                             # degree of freedom
N, bin = 1000000, 200
bins = np.linspace(0, 15, bin)

titles, legend = ['simulation', 'theory'], []
fig, ax = plt.subplots(1,2, figsize=(12, 4))

for df in dfs:
rv=np.square(np.random.normal(size=df*N).reshape(df, N)) # squred
rvst=np.sum(rv, axis=0) # and sum
y, x=np.histogram(rvst, bins=bins, density=True)

ax[0].plot(x[:-1], y)
ax[1].plot(bins, st.chi2.pdf(bins,df))
legend.append(f'\N{GREEK SMALL LETTER nu}={df}')
[ax[i].legend(legend) for i in range(2)]
[ax[i].set_ylim(0, 0.4) for i in range(2)]
[ax[i].set_title (titles[i]) for i in range(2)]
plt.show()
Chi-squared distributions for different degree of freedoms

Generation of F distribution from samples of normal distributions

The F-distribution with d_1 and d_2 degrees of freedom is the distribution of

S_1 and S_2 are independent random variables with chi-squared distributions with respective degrees of freedom. All analysis of variance relies on this distribution, which is the distribution of the ratio of two independent chi-squared random variables, each divided by their respective degrees of freedom. The code for generating the F-distribution is similar to the chi-squared distribution as follows:

df = [2, 10, 50, 100]
N = 100000; bins = 500
bins_arry = np.linspace(np.finfo(float).tiny, 4, bins)
fig, ax = plt.subplots(2,4, figsize=(16, 7))

rvst=[]
for i in [0,1]:
rvs=[]
for j in range(len(df)):
# same as chi-squared distribution
rv = np.square(np.random.normal(size=df[j]*N).reshape(df[j], N))
rvs.append(np.sum(rv, axis=0))
rvst.append(rvs)

for i in range(len(df)):
legend=[]
for j in range(len(df)):
# taking respective degree of freedom into account
par= (rvst[0][i]/df[i])/(rvst[1][j]/df[j])

y,x=np.histogram(par, bins=bins_array, density=True)
ax[0][i].plot(x[0:-1], y)
ax[1][i].plot(bins_array, st.f.pdf(bins_array, df[i], df[j]))
ax[0][i].set_title (f'simulation(\N{GREEK SMALL LETTER nu}={df[i]}')
ax[1][i].set_title (f'theory(\N{GREEK SMALL LETTER nu}={df[j]}')
legend.append(f'\N{GREEK SMALL LETTER nu}={df[j]}')
for k in [0,1]:
[ax[k][i].legend(legend) for i in range(len(df))]
[ax[k][i].set_xlim(0, 6) for i in range(len(df))]
plt.show()
F- distributions for different degrees of freedoms

Summary

Starting from a uniform distribution, we have studied numerical generations of several distributions using the Python code. As a metrologist, I would like to point out that the t-distribution gives an idea of the deviation of the mean. The chi-squared distribution gives an idea of the deviation of the standard deviation. Note that the former is distributed over the sum of the samples similarly to the mean. In contrast, the latter is distributed over the sum of the squared samples similarly to the standard deviation. The F-distribution is essential for the analysis of variance. One more note is that the t-distribution, chi-squared distribution and F-distribution are valid for only populations following the normal distribution. Is it now easy to remember?

This article is written with reference to Wikipedia

--

--