A Tour to Connect Popular Statistical Distributions…

Aniruddha Mitra
Analytics Vidhya
Published in
7 min readDec 29, 2019

Assumption check: You know what probability is. You’ve come across a few distributions.

Though the mathematics we do around these distributions are fairly straight-forward, sometimes you pause to scratch our heads to ask where the hell do these equations come from (Remember the awful equation of normal distribution?) Most of the standard books I’ve come across gives a modular picture for each of the distributions. So here is a small piece of the effort to connect them through the path of history so that you can make better intuition out of it.

Ideally, before jumping into the deep of distribution, you need to confirm your check-in through understandings of Random Experiment, Outcome, Event, Sample Space, Basic Probability, etc. I hope you’d allow me to skip that to keep our tour short enough to not create fatigue on you. But, one-line that I’ll be reiterating throughout this article is that Probability Distribution means you distribute the total probability, i.e. 1 (Some outcome we’re surely going to get), over the all possible outcomes. Now, stating all the out-comes equally likely is far from reality. So, which outcome gets how much probability is expressed by a mathematical equation. That equation is what we call distribution. We aim to dive deep to get a hang about how these equations came into the picture.

The Simplest one that set the track for the other complicated distributions is ‘Bernoulli Distribution’.

Let me explain,

  1. Random Experiment: One trial
  2. Random variable(X): Outcome of the experiment.
  3. Possible Outcomes: {Success, Failure} ~{1,0}
  4. Probability Distribution over possible outcomes:{P(Success), P(Failure} or {P(X=1), P(X=0} or {P(X=1), 1-P(X=1)}
  5. Parameter: This is something that we define before constructing distribution. Say the probability to get success in each independent trial is p. (0<p<1)
  6. Re-writing ‘Point 4’: Probability Distribution is {p,1-p}
  7. Rewriting ‘Point 6’ in the mathematical equation:
    P(X=x) = p^x * (1-p)^(1-x)
  8. De-mistifying ‘Point-7’: P(Success) = P(X=1) =..put value of x==1 in the equation…= 1, similarly P(Failure)= 0

Looks silly right? Actually, it’s not. Some real-world phenomena like, whether a batsman is going to get bowled out can very well be modeled with this. More to that, as we unfold, we’ll see how complicated real-world distributions stem from this only.
Now the underlying theory being so straight-forward why the name sounds so heavy? This was named after the famous swiss scientist Jacob Bernoulli, who made some groundbreaking revelation in calculus, statistics, etc..

From Bernoulli Distribution we get the next one, probably the most used for discrete variables, Binomial Distribution

Again, let me set the context.

  1. Random experiment: n independent Bernoulli trials
  2. Random Variable (X): Number of success(es)
  3. Possible outcomes(x):{0,1,2,…,n}
  4. Probability Distribution over possible outcomes: P(X=x), x={0,1,..,n}
  5. Parameter: Number of Bernoulli trials i.e. n
    & Parameter of Bernoulli trials i.e. p
  6. Mathematical function: This is a witty mixture of Bernoulli function with the number of possible ways (combination) to get x number of successes.
  7. omitting the math: …
  8. Final distribution:

Now, the word ‘Bi-nomial’ sounds so mathematical. Isn’t it? When you expand the above equation, it closely rembles the binomial expansion that you did in your +2 level maths. Nothing more than that.

Application of Binomial Distribution is ubiquitous. Like, in the previous context, the number of wickets that a bowler would get out of 60 independent deliveries.

Ready to have small tweak over Binomial? In binomial, we performed n ‘independent’ Bernoulli trial. To make the scene more realistic, these trials may not be independent. (Of course, it’s very rare to get consecutive wickets for a bowler.) Thus a small change over the binomial equation generates a new distribution i.e. Hyper-Geometric distribution. It’s very much in use for Acceptance Sampling in product auditing.

From n Bernoulli trials we get other distributions:

  1. Random Variable: Number of trials before you get the r-the success: Negative Binomial Distribution.
  2. Random Variable: Set r =1 in point 1: Geometric Distribution
Negative Binomial

The reason behind the names remains the same mathematical resemblance.

Special Case: When n is too high ~ infinity, p is too low ~0

Say you’re auditing a book to find spelling mistakes. Now, this can be modeled as a Binomial Random Experiment, where you conduct n (Number of words in the book) independent Bernoulli trial i.e. you check n-words independently with p=Prob(Success)=P(getting a word misspelled). Reasonably n is at the order of thousands & p is the opposite. In the era of the 17th century, in the absence of computing advantages, mathematicians had to try their skills to calculate the Binomial probability. So, using the limit theorem to the Binomial distribution they came up with a new distribution i.e. Poisson Distribution. I hope you can connect the Bernoulli-Binomial-Poisson connection here.

We’ll not be going to the limit theorem now, but it’s amazing to see all major count events (Number of accidents on a road over a given duration, Number of war, Number of goal in a soccer match, etc.)

Note: λ becomes the parameter, which signifies ‘rate’ of event i.e. Number of events/ Time (or space).

Why is that same distribution describing such different random variables? The answer is that the underlying physical conditions that produce those two sets of measurements are much the same, despite how superficially different the resulting data may seem to be. Both phenomena are examples of a set of mathematical assumptions known as the Poisson model.

This above-mentioned limit-trick was performed for the first time by eminent French mathematician Simeon Denis Poisson (1781–1840). Thus the name follows…

Poission gives birth to many ‘so-called complicated distribution as we’ll see.

Once the above mentioned Poisson (count) events are taking place, we were restricted to model their probability of the number of occurrences only. As more than one events take place sequentially, following Poisson distribution, scientists were quick to characterize the ‘waiting time’ between such events.

The interval length between consecutive Poisson events…

Suppose a series of events satisfying the Poisson model is occurring at the rate of λ per unit time. Let the random variable Y denote the interval between consecutive events. Note that we’re into continuous space now. Y can be any in [0, ∞). So, now we’ll make an equation that spits the probability of Y lying within a given range. Jumping directly to the conclusion, what we get is,

And? You just made Exponential Distribution!!!

To generalize Exponential distribution further, if we take the random variable as waiting time for r-th Poisson event, we get another distribution, known Gamma. Let’s skip the math part for the time being.

Hence, Bernoulli — Binomial — (Negative Binomial, Geometric) — Poisson — Exponential — Gamma

What if, p~0.5

Remember, post-Binomial all the distributions were made on the condition that p for each individual trial tends to 0. Obviously, the very next question that comes to mind, what happens when n is still close to ∞, but p is not around zero.

In the early 18th century, AbrahamDeMoivre proved that for a high value of n & p~ 0.5, the Binomial probability(Number of success:{0,1,….,n})can be estimated by the area under the ‘bell-shaped’ curve written as

f(x) = (1/(2*pie)^(0.5))*exp(-x²/2)

This particular equation was of little use, as it restricts the value of p around 0.5 only. However, this was going to be the starting point of the most popular distribution, any guess??, yes the Normal Distribution.

The French mathematician Pierre-Simon Laplace generalized DeMoivre’s original idea to binomial approximations for arbitrary p and brought this theorem to the full attention of the mathematical community by including it in his influential 1812 book, Theorie Analytique des Probabilities.

The math is a bit cumbersome. However, the point to make is a simple one, i.e. for very large n, any value of p, the required Binomial probability can be estimated by the area under the curve,

z = (X-np)/sqrt.(np(1-np)

This is Normal Distribution!!!!

As this is a continuous curve, this perfectly models many of real-world phenomena. If you’ve come across the ‘Central Limit Theorem’, you would know many other distributions converge to be Normal after some linear alteration.

However, you may make note that we get some other high-end distributions like 1. Chie-Square Dist, 2. t-Dist, 3.F-Dist from normal distribution itself.

To summarise the thousands words in one picture….

Please feel free to reach me out regarding this or any interesting Statistica, Machine Learning problem.
Email: aniruddha.mitra.am@gmail.com
LinkedIn : https://www.linkedin.com/in/aniruddhamitra/

--

--