Probability Distribution, Part-2

bhargavi sikhakolli
AI Skunks
Published in
8 min readMar 13, 2023

Topics Covered

  • Central Limit Theorem
  • T-test
  • Poisson Distribution
  • Exponential Distribution

Central Limit Theorem

The Central Limit Theorem (CLT) is a fundamental idea in statistics that states that, regardless of the shape of the original distribution, the average of a large number of independent and identically distributed random variables will tend to be normally distributed.

In simpler terms, this means that if you have a large enough sample of data, it will tend to look like a bell-shaped curve, even if the original data doesn’t look that way. The more data you have, the more closely the average will resemble a normal distribution.

import numpy as np
import matplotlib.pyplot as plt

# generate random numbers from a non-normal distribution
np.random.seed(123)
data = np.random.exponential(size=1000)

# calculate the mean of the first n samples (where n = [5, 10, 50, 100, 500, 1000])
ns = [5, 10, 50, 100, 500, 1000]
means = []
for n in ns:
sample = np.random.choice(data, size=n, replace=True)
mean = np.mean(sample)
means.append(mean)

# plot the histograms of the means
fig, ax = plt.subplots(2, 3, figsize=(15, 8))
ax = ax.ravel()
for i, n in enumerate(ns):
ax[i].hist(means[i], bins=20, color='red', alpha=0.5)
ax[i].set_title(f'n = {n}')

plt.tight_layout()
plt.show()

plotting the means through different samples gives following graphs

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(123)

# generate random numbers from a non-normal distribution
data = np.random.exponential(size=1000)

# draw 100 samples with replacement, each of size n = 10, 50, 100, 500, and 1000
n_sizes = [10, 50, 100, 500,750, 1000]
samples = {}
for n in n_sizes:
samples[n] = [np.random.choice(data, size=n, replace=True).mean() for i in range(100)]

# plot the histograms of the sample means
fig, ax = plt.subplots(2, 3, figsize=(15, 8))
ax = ax.ravel()
for i, n in enumerate(n_sizes):
ax[i].hist(samples[n], bins=20, color='red', alpha=0.5)
ax[i].set_title(f'n = {n}')

plt.tight_layout()
plt.show()

The resulting histograms show that as the sample size increases, the distribution of the sample means becomes increasingly close to a normal distribution, which is an example of the Central Limit Theorem in action.

The Central Limit Theorem has many real-world applications, including:

1. Statistical inference: The Central Limit Theorem provides a foundation for many statistical inference techniques, such as hypothesis testing and confidence intervals. By assuming that the sample mean is approximately normally distributed, we can make predictions and inferences about the population mean.

2. Economics: The Central Limit Theorem is frequently used in economics to model market trends and make predictions about future market behavior. For example, stock prices and returns are often modeled as being normally distributed, and the theorem is used to estimate the expected returns and risks associated with investing in the stock market. etc

Lets generate a random sample from population to see the distribution

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Define the population parameters
mu = 100
sigma = 20

# Generate a sample of size n=30
n = 30
sample = np.random.normal(mu, sigma, n)

# Calculate the sample mean
x_bar = sample.mean()

# Plot the population and sample distribution
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
population = stats.norm.pdf(x, mu, sigma)
sample_dist = stats.norm.pdf(x, x_bar, sigma / np.sqrt(n))

plt.plot(x, population, label='Population')
plt.plot(x, sample_dist, label='Sample')
plt.legend()
plt.show()

This code generates a random sample from a population with mean µ and standard deviation σ, and calculates the sample mean x_bar. The code then plots the population and sample distribution, showing that the sample mean x_bar is approximately normally distributed with mean µ and standard deviation σ / sqrt(n)

With this information, we can use hypothesis testing or confidence intervals to make inferences about the population mean. For example, we can use a t-test to determine if the sample mean x_bar is significantly different from the population mean µ, or we can use a confidence interval to estimate the range of values that the population mean is likely to fall within.

T-Test

A t-test is a statistical hypothesis test that is used to determine whether the mean of a sample is significantly different from a known or hypothesized population mean. The t-test is commonly used in situations where the population standard deviation is unknown or when the sample size is small.

import numpy as np
import scipy.stats as stats

# Define the population parameters
mu = 100
sigma = 20

# Generate a sample of size n=30
n = 30
sample = np.random.normal(mu, sigma, n)

# Calculate the sample mean and standard deviation
x_bar = sample.mean()
s = sample.std()

# Perform a t-test to determine if the sample mean is significantly different from the population mean
t_statistic, p_value = stats.ttest_1samp(sample, mu)
if p_value < 0.05:
print("The sample mean is significantly different from the population mean (p = {:.4f})".format(p_value))
else:
print("The sample mean is not significantly different from the population mean (p = {:.4f})".format(p_value))

# Calculate a 95% confidence interval for the population mean
alpha = 0.05
t_critical = stats.t.ppf(1-alpha/2, n-1)
margin_of_error = t_critical * s / np.sqrt(n)
confidence_interval = (x_bar - margin_of_error, x_bar + margin_of_error)
print("The 95% confidence interval for the population mean is ({:.2f}, {:.2f})".format(confidence_interval[0], confidence_interval[1]))

output :

The sample mean is not significantly different from the population mean (p = 0.7498)

The 95% confidence interval for the population mean is (94.55, 107.53)

This code performs a one-sample t-test and calculates a 95% confidence interval for the population mean.

The results of the t-test tell us whether the sample mean is significantly different from the population mean. If the p-value is less than 0.05, we reject the null hypothesis that the sample mean is equal to the population mean and conclude that the sample mean is significantly different from the population mean. On the other hand, if the p-value is greater than or equal to 0.05, we fail to reject the null hypothesis and conclude that the sample mean is not significantly different from the population mean.

The confidence interval gives us an estimate of the range of values within which the true population mean is likely to lie, with a confidence level of 95%. In other words, we can be 95% confident that the population mean lies within the calculated confidence interval.

How these results help to move forward:

These results help us move forward by giving us information about the population mean based on the sample data. If the sample mean is significantly different from the population mean, we can conclude that the sample is not representative of the population and further investigation may be necessary. If the sample mean is not significantly different from the population mean, we can use the confidence interval to make inferences about the population mean.

Poisson Distribution

The Poisson distribution models the probability of a given number of events occurring in a fixed interval of time or space, given that these events occur at a constant average rate and independently of the time since the last event. It has one parameter, lambda, which represents the average rate of events per interval.

Suppose we are analyzing the number of customers who enter a store per hour. We observe that on average, 10 customers enter the store per hour. We can model the distribution of the number of customers using Poisson distribution, which is a discrete probability distribution that models the probability of a given number of events occurring in a fixed interval of time or space, given the average rate of occurrence of the event.

The Poisson distribution has a single parameter, lambda (λ), which represents the average rate of occurrence of the event. In this case, λ = 10, since on average, 10 customers enter the store per hour. The probability mass function (PMF) of the Poisson distribution is given by:

P(X = k) = (λ^k * e^(-λ)) / k!

where X is the random variable representing the number of customers, k is the number of customers, e is the mathematical constant approximately equal to 2.71828, and k! represents the factorial of k.

Using this PMF, we can calculate the probability of different numbers of customers entering the store per hour. For example, the probability of exactly 8 customers entering the store in one hour is:

P(X = 8) = (10⁸ * e^(-10)) / 8! = 0.106

Similarly, we can calculate the probabilities of other numbers of customers entering the store per hour. We can also calculate the mean and variance of the Poisson distribution, which are both equal to λ. In this case, the mean and variance are both equal to 10, which means that the distribution is symmetric around 10.

Overall, Poisson distribution is useful for modeling events that occur randomly in time or space, such as the number of customers entering a store, the number of defects in a product, or the number of calls to a customer service center per hour.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson

# Generate Poisson distributed data
lam = 5 # average rate of events per interval
data = poisson.rvs(lam, size=1000)

# Plot histogram of the data
plt.hist(data, bins=20, density=True)

# Plot probability mass function
x = np.arange(0, 20)
pmf = poisson.pmf(x, lam)
plt.plot(x, pmf, 'ro', ms=8)

# Calculate mean and variance
mean, var = poisson.stats(lam, moments='mv')
print('Mean:', mean)
print('Standard deviation:', np.sqrt(var))

Insights: The Poisson distribution is often used to model rare events that occur independently of each other, such as the number of arrivals at a service center, the number of defects in a batch of products, or the number of accidents on a highway. The mean and variance of the Poisson distribution are both equal to lambda, which represents the average rate of events per interval.

Exponential Distribution:

The exponential distribution models the probability of a given time interval between two successive events occurring in a Poisson process, given that these events occur at a constant average rate and independently of the time since the last event. It has one parameter, lambda, which represents the average rate of events per unit of time.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon

# Generate exponential distributed data
lam = 0.5 # average rate of events per unit of time
data = expon.rvs(scale=1/lam, size=1000)

# Plot histogram of the data
plt.hist(data, bins=20, density=True)

# Plot probability density function
x = np.linspace(0, 10, 100)
pdf = expon.pdf(x, scale=1/lam)
plt.plot(x, pdf)

# Calculate mean and variance
mean, var = expon.stats(scale=1/lam, moments='mv')
print('Mean:', mean)
print('Standard deviation:', np.sqrt(var))

The exponential distribution is often used to model the time between two successive events in a Poisson process, such as the time between arrivals at a service center, the time between failures of a machine, or the time between earthquakes. The mean of the exponential distribution is 1/lambda, which represents the average time between events, and its variance is also 1/lambda².

--

--