Crash Course in Data: Probability Distribution, Part-1

Topics Covered

bhargavi sikhakolli
AI Skunks
8 min readMar 13, 2023

--

  • Probability
  • Random variables
  • CDF (Cumulative Density Function)
  • PDF (Probability Density Function)
  • Uniform Distribution
  • Normal Distribution

Suppose you have a fair six-sided dice, and you want to know the probability of rolling a 4.

Probability = Favorable outcome / Total possible outcomes

Here it is 1/6

Why is Probability important?

Probability is important in our daily lives in many ways. Making Decisions, Gaming, Health and Medicine, Finance, Weather Forecasting etc.

These are just a few examples of how probability plays a role in our daily lives. Probability provides a framework for quantifying and analyzing uncertainty, which enables us to make more informed decisions, predictions, and assessments.

Probability vs Probability Distribution:

Probability refers to the likelihood of a particular event occurring, expressed as a value between 0 (impossible) and 1 (certain) while Probability distribution refers to the set of all possible values of a random variable and the probabilities associated with each value. It describes the distribution of values that a random variable can take.

Note: The total probability of all events sums up to 1

Some Examples distributions:

Introduction to probability Distribution

  • A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can take. It assigns probabilities to each possible outcome of a random event.
  • It is important because it allows us to make predictions and analyse uncertainty in the data by understanding the distribution of the data and its characteristics, such as the mean and variance.
  • The information provided by the probability distribution helps in decision making, hypothesis testing and modelling of random events.

Eg: Imagine you are flipping a fair coin. The possible outcomes are either heads or tails, and each outcome has an equal chance of occurring, so the probability of getting heads or tails is 0.5.

This idea can be extended to any random event, where the random variable can take different values.

Example

The height of people in a certain population is a random variable, and the probability distribution of the height can be modelled using a histogram or a function like a normal distribution.

In simple terms, probability distribution tells us the chances of different outcomes in a random event.

There are many types of probability distributions, with new ones being developed as needed to model different phenomena. Some common types include:

  • Normal distribution
  • Binomial distribution
  • Poisson distribution
  • Exponential distribution
  • Uniform distribution
  • Bernoulli distribution etc.

The specific probability distribution used depends on the nature of the random variable being modelled

Random Variables

A random variable is a variable that can take on different values based on the outcome of a random event. In probability theory, a random variable is a numerical outcome of a random event. There are two types of random variables: discrete and continuous.

Discrete random variables

  • can only take on specific values, such as the number of heads when flipping a coin. Eg:{0,1}

Continuous random variables

  • can take on any value within a specified interval Eg:such as the height of a person can range from [160,190]
  • To represent distribution of continuous random variables Probability Distribution functions are used

Let’s understand Probability distribution function(PDF) vs Probability mass function(PMF)

PDF: The probability density function is used to represent the distribution of continuous random variables. It defines the probability density of a continuous random variable at a given value, which is the derivative of its cumulative distribution function. The PDF gives us information about the likelihood of observing values in a given interval, rather than specific values. The PDF is always positive and its integral over the entire range of the random variable must equal 1.

import numpy as np
import matplotlib.pyplot as plt

# Generate random ages data
np.random.seed(0)
mean_age = 30
std_age = 10
ages = np.random.normal(mean_age, std_age, 1000)

# Plot histogram of the ages data
plt.hist(ages, bins=20, density=True, color='blue', alpha=0.5)

# Plot the pdf using a Gaussian distribution
def pdf(x, mean, std):
return (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(-0.5 * ((x - mean) / std)**2)

x = np.linspace(0, 60, 100)
y = pdf(x, mean_age, std_age)
plt.plot(x, y, 'r')

plt.xlabel('Age')
plt.ylabel('Probability Density')
plt.title('Age Distribution of People')
plt.show()

Consider above distribution

This code generates random ages of people and plots a histogram of the data. The histogram shows the frequency of ages in the data. The red curve represents the pdf, which gives an estimate of the underlying distribution of the ages data.

PMF: The probability mass function is used to represent the distribution of discrete random variables. It defines the probability of a discrete random variable taking on a specific value. The PMF assigns a probability to each possible value that the random variable can take on, and the sum of all the probabilities must equal 1.

Cumulative Density function

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

ages = [23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99]

def cumulative_density_function(x, data):
return stats.norm.cdf(x, loc=np.mean(data), scale=np.std(data))

x = np.linspace(min(ages), max(ages), num=100)
y = [cumulative_density_function(i, ages) for i in x]

plt.plot(x, y)
plt.xlabel("Age")
plt.ylabel("Cumulative Density Function")
plt.title("Cumulative Density Function of Age")
plt.show()

The difference between a cumulative density function (CDF) and a probability density function (PDF) is that a PDF gives the probability density (or probability per unit value) of a random variable at a certain point, whereas a CDF gives the cumulative probability (total probability) of a random variable up to a certain point.

For example, if we continue with the age of people example, the PDF of 40 would give us the density of people at a certain age, while the CDF would give us the total number of people who are less than or equal to that age (i.e density of 40 + density of 30 + density of 20 + density of 10).

Uniform Distribution

Example : if you roll a fair six-sided die, the range of values you can get is 1 to 6. Each value has an equal probability of 1/6 of being rolled, making it a uniform distribution.

The uniform distribution is used in many real-world applications, such as in simulations, random number generation, and statistical hypothesis testing.

f(x) = 1/(b — a), for a <= x <= b

0, otherwise

where f(x) is the probability density at x, and (b — a) is the length of the range

The uniform distribution has the following properties:

Mean: The mean (expected value) of the uniform distribution is given by: μ = (a + b)/2

Variance: The variance of the uniform distribution is given by: σ² = (b — a)² / 12

Median: The median of the uniform distribution is equal to the midpoint of the range, (a + b)/2.

%matplotlib inline 
# %matplotlib inline is a magic function in IPython that displays images in the notebook
# Line magics are prefixed with the % character and work much like OS command-line calls
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas.testing as tm
from scipy import stats
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Make plots larger
plt.rcParams['figure.figsize'] = (10, 6)

#------------------------------------------------------------
# Define the distribution parameters to be plotted
W_values = [1.0, 3.0, 5.0]
linestyles = ['-', '--', ':']
mu = 0
x = np.linspace(-4, 4, 1000)

#------------------------------------------------------------
# plot the distributions
fig, ax = plt.subplots(figsize=(10, 5))

for W, ls in zip(W_values, linestyles):
left = mu - 0.5 * W
dist = stats.uniform(left, W)

plt.plot(x, dist.pdf(x), ls=ls, c='black',
label=r'$\mu=%i,\ W=%i$' % (mu, W))

plt.xlim(-4, 4)
plt.ylim(0, 1.2)

plt.xlabel('$x$')
plt.ylabel(r'$p(x|\mu, W)$')
plt.title('Uniform Distribution')

plt.legend()
plt.show()

# Adapted from http://www.astroml.org/book_figures/chapter3/fig_uniform_distribution.html

- For particular events it shows uniform distribution i.e every problem distribution is equally likely, but collection of independent events tend to follow normal distribution

Normal Distribution

The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution that is symmetrical around its mean and characterized by its mean and standard deviation. It is one of the most widely used distributions in statistics and is commonly used to model real-world data.

Graphically, a normal distribution can be visualized as a bell-shaped curve with the mean (average) at the center and the standard deviation representing the spread of the data.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

mean = 0
standard_deviation = 1

x = np.linspace(mean - 3*standard_deviation, mean + 3*standard_deviation, 100)
y = (1 / (np.sqrt(2 * np.pi) * standard_deviation)) * np.exp(-0.5 * (x - mean) ** 2 / standard_deviation ** 2)

sns.lineplot(x, y)
plt.show()

This plot has a mean of 0 and a standard deviation of 1.

Why is Normal Distribution so popular?

Normal distribution is widely used in many fields, such as statistics, biology, economics, engineering, and psychology, due to several reasons:

Simplicity: Normal distribution is mathematically simple to describe and manipulate. This makes it easy to use in many different applications and to fit to a wide range of data sets.

Central Limit Theorem: The central limit theorem states that the sum of a large number of independent and identically distributed random variables is approximately normally distributed. This theorem has many practical applications in fields such as finance and engineering, and makes normal distribution a useful model in many real-world scenarios.

Versatility: Normal distribution is widely observed in many natural and social phenomena, making it a useful model for a wide range of applications. Additionally, normal distribution is parameterized by just two parameters (mean and standard deviation), which makes it easy to fit to a data set and to make predictions about the probabilities of different outcomes.

Mathematical Tractability: Normal distribution has many important mathematical properties, such as its symmetry and the availability of closed-form expressions for probabilities and quantiles. This makes it a useful tool for many types of statistical analysis, such as hypothesis testing and regression analysis.

Common Assumptions: In many fields, normal distribution is often used as a default assumption for data that is approximately symmetrical and does not have strong outliers. This is because normal distribution provides a convenient starting point for many statistical analyses and because it is widely recognized and understood.

These are some of the reasons why normal distribution is so popular. While normal distribution is not always the best choice for a particular data set or application, its simplicity, versatility, and mathematical tractability make it a useful tool in many real-world scenarios.

Central Limit Theorem, T-test and more are covered in part-2 https://medium.com/@bhargavi.sikhakolli31/probability-distribution-part-2-e286cf7cce99

--

--