Probability Distributions — Statistics for machine learning

9 min readSep 30, 2023

Understanding Probability distributions

A probability distribution is a mathematical function that describes the likelihood of different possible outcomes for a random variable. Probability distributions allow us to quantify uncertainty and make predictions about future events or measurements based on past data.

There are several common types of probability distributions used in statistics. The binomial distribution describes the probability of a certain number of successes in a fixed number of yes/no experiments, like coin flips.

The Poisson distribution models the number of rare, independent events occurring in a fixed time period or space, like phone calls arriving at a call center.

The uniform distribution assigns equal probability to all outcomes within a range.

Probability distributions are often depicted graphically using histograms or probability density functions. They can also be expressed numerically through probability mass functions or cumulative distribution functions.

Real-world data is usually summarized using empirical frequency distributions, which show the observed frequencies of different outcomes in a sample. However, probability distributions aim to describe the underlying population rather than a specific sample.

Probability distributions have many applications. They allow us to calculate probabilities, means, variances and other statistics for random variables.

In hypothesis testing, we use probability distributions like the normal, t and F distributions to determine p-values and evaluate whether sample data is consistent with a hypothesized population distribution. Probability distributions also play a key role in areas like reliability engineering, queuing theory, and risk analysis.

Understanding probability distributions is fundamental to statistical modeling and making inferences from data. They provide a framework for quantifying uncertainty and reasoning about future outcomes based on past observations and our knowledge of the processes that generated those observations. Mastering common probability distributions opens the door to more advanced statistical techniques across many domains.

Example

Suppose a shipping company wants to estimate delivery times for packages traveling between two cities. They record the delivery time of 100 random packages and construct a histogram showing the frequency of different time ranges (e.g. number of packages delivered in 1–2 days, 2–3 days, etc.).

This gives a rough idea of the probability distribution — most packages are delivered within 5 days, very few take over a week. But the estimates from a small sample aren’t very precise.

The shipping company notices the delivery times appear symmetrically distributed around a central value, with shorter tails — a classic bell curve shape. This suggests the times may follow a normal distribution.

By fitting a normal distribution to the sample data, they can estimate the mean and standard deviation delivery time. Even with a small initial sample, this probability distribution allows much more accurate probability calculations.

For example, they can precisely determine the probability of a package arriving within 5 days, or taking longer than a week. They can also predict how delivery times may change with factors like package weight or weather conditions.

The normal distribution provides a convenient idealized model even when the true population is only approximated based on limited observations. This improves predictive power over simply examining frequency tables from small samples.

Probability distributions in machine learning

Machine learning algorithms rely on probability distributions to model real-world data and make predictions. At their core, many machine learning techniques involve estimating probability distributions from sample data and using those distributions to generalize to new examples.

For example, Bayesian networks represent conditional dependencies between variables as probability distributions. By quantifying relationships in terms of probabilities, Bayesian models can infer likely outcomes even when some information is unknown or missing. Similarly, Naive Bayes classifiers assume independence between features and estimate each feature’s probability distribution conditioned on the class. Making predictions then involves calculating posterior probabilities based on these distributions.

In deep learning, neural networks often model input and output data as probability distributions rather than single values. For images, pixel intensities may be modeled as samples from a multivariate Gaussian distribution.

Recurrent neural networks internally represent sequences as joint probability distributions over timesteps. Generative adversarial networks pit a generator network against a discriminator by learning data distributions in an adversarial training process.

Reinforcement learning agents also rely on probability distributions, both to represent environment dynamics and to optimize policies. Model-based RL approaches explicitly model state transition probabilities, while model-free methods implicitly learn value distributions to guide decisions. Probability distributions also allow RL agents to reason about uncertainty, generalize across states, and optimize exploration strategies.

For machine learning practitioners, a solid understanding of common probability distributions like multivariate Gaussian, Dirichlet, and exponential family distributions is valuable.

Properly modeling data as probability distributions opens up powerful techniques like Bayesian modeling, density estimation, and probabilistic programming. It also provides principled ways to quantify and reduce uncertainty, which is critical for many real-world machine learning applications.

Different Probability Distributions

1. Normal / Gaussian distribution or probability bell curve

There are many cases where the data tends to be around a central value with no bias left or right, and it gets close to a “Normal Distribution” like this:

The blue curve is a Normal Distribution. The yellow histogram shows some data that follows it closely, but not perfectly (which is usual).

Many things closely follow a Normal Distribution:

heights of people
marks on a test
Ratings in games/sports
Lifespans of mechanical/electronic parts
Temperature variations

Properties of the normal distribution:

mean = median = mode
symmetry about the center
50% of values less than the mean
and 50% greater than the mean

When we calculate the standard deviation we find that generally:

68% of values are within 1 standard deviation of the mean

95% of values are within 2 standard deviations of the mean

99.7% of values are within 3 standard deviations of the mean

It is good to know the standard deviation, because we can say that any value is:

likely to be within 1 standard deviation (68 out of 100 should be)
very likely to be within 2 standard deviations (95 out of 100 should be)
almost certainly within 3 standard deviations (997 out of 1000 should be)

The number of standard deviations from the mean is also called the “Standard Score”, “sigma” or “z-score”. Get used to those words!

Example

You scored 135 on a test. You can see on the bell curve that 135 is 3 standard deviations from the mean of 100, so:

Your test score has a “z-score” of 3.0

It is also possible to calculate how many standard deviations 135 is from the mean

How far is 135 from the mean?

It is 135–100 = 35 from the mean

How many standard deviations is that? The standard deviation is 45, so:

135 / 45= 3 standard deviations

So to convert a value to a Standard Score (“z-score”):

first subtract the mean,
then divide by the Standard Deviation

And doing that is called “Standardizing”:

Why Standardize … ?

It can help us make decisions about our data. It also makes life easier because we only need one table (the Standard Normal Distribution Table), rather than doing calculations individually for each value of mean and standard deviation.

For Example

Here is the Standard Normal Distribution with percentages for every half of a standard deviation, and cumulative percentages:

Example: Your score in a recent test was 0.5 standard deviations above the average, how many people scored lower than you did?

Between 0 and 0.5 is 19.1%
Less than 0 is 50% (left half of the curve)

So the total less than you is:

50% + 19.1% = 69.1%

In theory 69.1% scored less than you did (but with real data the percentage may be different)

2. Poisson distribution : Modeling Count Data

The Poisson distribution is one of the most widely used probability distributions in statistics. It is used to model the number of times an event occurs in a fixed interval of time or space when the occurrences are independent and random. Some key characteristics of the Poisson distribution include:

The variable being modeled can only take non-negative integer values (0, 1, 2, etc.), as it represents counts of discrete events.
It is defined by one parameter, usually denoted by λ, which represents the expected number of occurrences in the given interval.
As λ increases, the distribution shifts to the right, with higher counts becoming more probable.
For small values of λ, the Poisson closely resembles a geometric distribution. For large λ, it approximates the normal distribution.
Events modeled by the Poisson must be independent — the occurrence of one event does not affect the probability of another. They must also occur at a constant average rate.

The Poisson distribution has many applications. It is commonly used to model counts like the number of phone calls to a call center per hour or words in a document. It also describes waiting times between rare, independent events.

In statistics, the Poisson arises when modeling discrete events using the binomial distribution and taking the limit as the number of trials increases while the probability of success decreases. It plays an important role in probability theory.

In machine learning, the Poisson distribution is frequently used for predictive modeling of count data via Poisson regression. It also underlies techniques like topic modeling with LDA and recommendation systems. Understanding the properties of this distribution is valuable for working with count and event-based data.

For example,

The number of customers arriving at a fast food restaurant during each 30-minute period follows an average of 12 arrivals per half hour.

Although the average is 12 customers, the actual number could be any non-negative integer value.

The arrival of each customer is independent — one customer arriving does not affect the chances of another arriving.

The arrival rate can be assumed constant during each half hour period. The probability of arrivals in the first 15 minutes is the same as the last 15 minutes.

Of course, the restaurant cannot physically accommodate an infinite number of customers in 30 minutes. But practically, the Poisson distribution provides a good approximation of the random customer arrival process during discrete time periods.

It models the situation’s probabilistic behavior well despite some differences from theoretical assumptions, making it an appropriate choice to analyze questions like the likelihood of more than 15 arrivals in a given half hour.

Given that a situation follows a Poisson distribution, there is a formula which allows one to calculate the probability of observing k events over a time period for any non-negative integer value of k.

Where:

e is Euler’s number (e = 2.71828…)
x is the number of occurrences
! is the factorial function
is the average number of times an event occurs

What Makes a Probability Distribution Legitimate?

For a probability distribution to accurately model real-world phenomena, it must satisfy certain criteria that lend it legitimacy. Here are some key factors that determine whether a given distribution is appropriate for a particular application or dataset.

Empirical Fit — The distribution should provide a good fit to empirical frequency data from samples or observations. Quantile-quantile plots and goodness-of-fit tests assess how well theoretical quantiles match the sample. A poor fit calls the model into question.

Independence — For many distributions, the probability of each outcome must be independent of past outcomes. Dependencies violate the assumption of mutual exclusivity. Independence can be assessed statistically.

Stationarity — The process generating the data should be time-invariant, such that the distribution would not change over repeated observations. Non-stationarities may require more complex models.

Sample Size — Distributions ideally describe large populations, you need to choose an acceptable number of samples to represent a population. Small samples hamper claims of legitimacy.

Theoretical Basis — The distribution’s assumptions about the data-generating process should match empirical realities. For example, normally distributed variables often arise from the summation of small, independent factors.

Prior Knowledge — Existing domain expertise and theories can lend credibility or raise doubts about a distribution’s ability to characterize a process. Legitimacy comes from agreement between model and mechanisms.

Probability Distributions — Statistics for machine learning

Understanding Probability distributions

Example

Probability distributions in machine learning

Different Probability Distributions

1. Normal / Gaussian distribution or probability bell curve

Example

Why Standardize … ?

2. Poisson distribution : Modeling Count Data

What Makes a Probability Distribution Legitimate?

Written by Abdallah Ashraf