Statistics & Probability — Probability Distribution

The Normal & Binomial Distribution

Published in

OmarElgabry's Blog

9 min readFeb 24, 2019

This series of articles inspired by Statistics with R Specialization from Duke University. The full series of articles can be found here.

Normal Distribution

Properties of normal distribution

It’s unimodal and symmetric (usually, called bell curve).
68%, 95%, 99.7% falls within 1, 2, 3 standard deviation of the mean respectively.
The normal distribution has two parameters: mean and standard deviation
The shape of the curve might be narrow or wide.

Standardize Scores (Z scores)

Z score is the number (distance) of standard deviation away (above or below) the mean. By definition the Z score of the mean is 0.

Unusual observations (outliers) are considered to be 2 (or more) above or below the mean.

The Z score can be calculated for any kind of distribution, not only the normal distribution.

(~) Which two applicants scored better on different exams?. Pam, who earned an 1800 on her SAT (mean=1500, sd=300), or Jim, who scored a 24 on his ACT (mean=21, sd=5).

We can’t compare these tests since they are on different scale. We can compare their sd above the mean instead. In other words, their Z-scores.

So.Pam = score — mean / sd = 1800 — 1500 / 300 = 1, and Jim = 24 — 21 / 5 = 0.6. This means Jim has better score since it’s closer to the mean.

Percentiles

Only when the distribution is normal, Zscores can also be used to calculate percentiles.

percentile is the area below the probability distribution curve to the left of that observation — source

The percentile is the rank of an observation given the data in order. In other words, it’s the percentage of all observations that fall below a given data point.

A student score is at 40% with respect to all the scores.

There is 1–1 correspondence between Z score and percentile. Both, they measure how far an observation from the mean.

How to calculate percentile?. Using R, online calculator, or Zscore table.

Evaluating the Normal Distribution

How can we determine if the data plotted is normally distributed or not?

Data are plotted on y-axis, and quantiles for normal distributed on x-axis
If there is 1–1 correlation between x and y axis; straight line, then data is normally distributed.

Quantiles are cut points dividing the range of a distribution into intervals.

On scatter plot. The more data close to the straight line, the more normal they are distributed.

On histogram. It’s unimodal and symmetric.

— Using 68%, 95%, 99.7% to assess normality

We can also use “68%, 95%, 99.7” to assess normality.

(~) The number of hours of sleep college students has a mean of 7.

Given that:
- If 62% of students sleep between 6 and 8 hours 
- And 92% of students sleep between 5 and 9 hours 
- And 95% sleep between 4 and 10 hours.

So, we would expect 99.7% of the data between 4–10 hours. Since a lower percentage (95%) than the expected (99.7%) fall within these ranges. It means data are more spread out (variable; curve is wider).

Working with the Normal Distribution

(~) The mean of an airline baggage is 45 pounds and standard deviation 3.2 pounds. What percent of airlines passengers are expected to excess of 50 pounds?

Zscore = 50 — 45 / 3.2 = 1.56
percentile = 0.94 (using R or online calculator)

The 0.94 is the percentile (percentage) of baggages ≤ 50 pounds. To get percentage of baggages > 50 pounds = 1 – 0.94 = 0.0594.

(~) Among 3,000 songs, the mean length of songs is 3.45 minutes and the standard deviation is 1.63 minutes. Calculate the probability that a randomly selected song lasts more than 5 minutes.

You might be thinking about Zscore and percentile. But, we can’t since the distribution is not normal. The distribution here is right skewed as fewer songs as the number of minutes increases.

(~) The mean temperature is 77, with standard deviation 5. What’s the temperature of the 20% coldest day?

So, given the percentile, what’s the relative temperature?.

Zscore = -0.84 (using R or Zsheet given percentile and mean)
Temperature (observation) = (Zscore x sd) + mean 
                          = (-0.84 x 5) + 77
                          = 72.8

Binomial Distribution

When an experiment has only two possible outcomes, the result is what we call a binomial random variable.

A coin flip can only result in heads and tails.
Eligible voters can either vote or not vote.
A patient can either test positive or negative for a disease.

These are possible binomial random variables, provided we have n trials with a probability of success of each trial, called p.

Binomial conditions (characteristics)

The trials are independent.
The number of trials, n, is fixed.
Each trial outcome can be classified as a success or failure.
The probability of a success, p, is the same for each trial.

(~) Participants were asked to play a game, their success rate was recorded. The result is about 65% of people would fail, and 35% would pass.

If we randomly select 4 (n) individuals to participate in this experiment. What is the probability that exactly 1 (k) of them will succeed?

Since each trial is independent. We can use the joint rule.

The probability of 1st scenario = 0.35 x 0.65 x 0.65 x 0.65 = 0.0961
The probability of 2nd scenario = 0.65 x 0.35 x 0.65 x 0.65 = 0.0961
… same for each scenario.

Each scenario has one success. So, the probability for all scenarios, that 1 person will succeed = 0.0961 + 0.0961 + 0.0961 + 0.0961 = 0.0961 * 4 = 0.3844

Binomial distribution (probability)

What’s the probability of having exactly k successes in n independent binomial trials with probability of success p?

P(k successes) = P(success scenario (SS)) * N of scenarioswhere,P(SS) = (probability of success ^ number of success) * (probability of failure ^ number of failure)
      = (p^k) * ((1-p) ^ (n-k)) and,N of scenarios = Number of ways to choose k success in n trials 
               = n! / (k! * (n - k)!)

In the previous example, we had only 1 success, so number of scenarios = 4. But, what if we have 20 trials with 3 successes?.

(~) Only 13% of employees are committed at work. Among a random sample of 10 employees, what is the probability that 8 of them are committed at work?

Given that: 
n = 10
k = 8
p(success) = 0.13 
p(failure) = 1 - 0.13 = 0.87.

So. P(k=8) = (0.13)⁸ * (0.87)² * (10! / 8!(10–8)!) = 0.000002 (unlikely to occur).

The mean and standard deviation of binomial distribution

(~) Among a random sample of 100 employees, how many would you expect to be engaged (committed) at work?.

The mean of binomial distribution (average number of success) = n * p = 100 * 0.13 = 13.

But it doesn’t mean that every sample has exactly 13 committed at work since its a sample taken from a population, and samples vary from one to another.

So how much would we expect this value to vary? As usual, we can quantify the variability using the standard deviation.

The standard deviation of binomial distribution = sqrt(n * p * (1-p)).

Normal Approximation to Binomial

— Visualize Binomial probability (Histogram)

When visualizing the binomial probability using histogram. Say we have a probability of success 0.25, n = 10.

The x-axis has possible (k) success values [0, 10] = 11 bars.
The y-axis has the likelihood of each k.

The more we increase n, the center value of the distribution increases, becoming more unimodal and symmetric.

(~) Say, 25% are considered power users on Facebook. Each user has around 245 friends. Whats the probability of having 70 or more power users.

Given that: 
p = 0.25
n = 245 and also assuming all binomial conditions apply here.What’s P(k >= 70) ?.

We know how to get P(k = 70), but what about P(k >= 70)?

It’s equal to P(k = 70) + P(k = 71) ... + P(k = 245). But instead we can treat binomial distribution as the normal distribution.

— Resembling binomial distribution and the normal distribution

To resemble binomial distribution and the normal distribution. The area of interest (k >= 70) under the curve (histogram) can be calculated same as we did in normal distribution:

Get mean of binomial distribution = 245 * 0.25 = 61.25
Get sd = sqrt(245 * 0.25 * 0.75) = 6.75
Get Z score = 70 — 61.25 / 6.75 = 1.29 (~= 1.3 sd away from the mean)
Get percentile = 0.9015 (area below or equal k = 70)

So, probability of k ≥ 70 = 1 — 0.9015 = 0.0985. There is 9.8% that someone will have more than 70 power users.

There is a minor difference when compared to the result from R. A suggested solution to this is to subtract 0.5 from 70 in calculations.

— Approximate binomial and normal distribution

We can visually confirm that binomial distribution looked unimodal and symmetric, roughly similar to a normal distribution.

But what are some guidelines to determine whether the sample size (trials) is large enough so we can be confident in estimating the binomial distribution using the normal?.

The success-failure rule: n*p >= 10 AND n*(1-p) >= 10.

The larger the sample size, the better.

If so, we can approximate the mean and sd to normal distribution by using the steps and rules we discussed.

Working with the Binomial Distribution

(~) 56% of Americans plan to get health insurance. What is the probability that in a random sample of 10 people exactly 6 plan to get it?.

P(k=6) = 0.243 (using an online calculator)

It means there is 24.3% chance that in a random sample of 10 people exactly 6 plan to get insurance.

(~) What is the probability that in a random sample of 1000 people, exactly 600 plan to get insurance?.

We multiplied n and k by 100. If we to recall from “visualize Binomial probability”, we know that value of the center will increase.

Then the k values at x-axis will be shifted to left (decrease), and so their respective y-axis will decrease.

So, the answer is definitely less than 0.243.

(~) Describe the probability distribution of Americans who plan to get insurance among a random sample of 100.

First. can it be approximated to normal distribution?.

Success = 100 * 0.56 = 56 >= 10
Failure = 100 * 0.44 = 44 >= 10

So. The shape of distribution will be nearly normal.

It has two parameters; mean and sd. The mean = 56, sd = sqrt(100 * 0.56 * 0.44) = 4.96.

(~) What is the probability that at least 60 out of a random sample of 100 uninsured Americans plan to get insurance?. P(k >= 60).

Zscore = 60 — 56 / 4.96 = 0.81 (or use 59.5 for better accuracy).
percentile = 0.7910. So, P(k ≥ 60) = 1 – 0.7910 = 0.209.

So. There is 20% chance that 60 or more out of 100 will plan to get insurance.

Thank you for reading! If you enjoyed it, please clap 👏 for it.