The Bernoulli and Binomial Distributions

Maryam Raji
Analytics Vidhya
Published in
11 min readJan 24, 2020
An image of a coin

Probability Distributions — Binomial Distributions

The probability for a discrete random variable can be summarized with a discrete probability distribution.

Discrete probability distributions are used in machine learning, most notably in the modeling of binary and multi-class classification problems, but also in evaluating the performance for binary classification models, such as the calculation of confidence intervals, and in the modeling of the distribution of words in text for natural language processing.

Knowledge of discrete probability distributions is also required in the choice of activation functions in the output layer of deep learning neural networks for classification tasks and selecting an appropriate loss function.

Discrete probability distributions play an important role in applied machine learning and there are a few distributions that a practitioner must know about.

In this tutorial, you will discover discrete probability distributions (Bernoulli and Binomial Distribution) used in machine learning.

Let’s get started.

Random Variables

A random variable is a quantity produced by a random process.

A discrete random variable is a random variable that can have one of a finite set of specific outcomes. The two types of discrete random variables most commonly used in machine learning are binary and categorical.

Binary Random Variable: x in {0, 1} Categorical Random Variable: x in {1, 2, …, K}. A binary random variable is a discrete random variable where the finite set of outcomes is in {0, 1}. A categorical random variable is a discrete random variable where the finite set of outcomes is in {1, 2, …, K}, where K is the total number of unique outcomes.

Each outcome or event for a discrete random variable has a probability.

The relationship between the events for a discrete random variable and their probabilities is called the discrete probability distribution and is summarized by a probability mass function or PMF for short.

For outcomes that can be ordered, the probability of an event equal to or less than a given value is defined by the cumulative distribution function or CDF for short.

The inverse of the CDF is called the percentage-point function and will give the discrete outcome that is less than or equal to a probability.

PMF: Probability Mass Function, returns the probability of a given outcome.
CDF: Cumulative Distribution Function, returns the probability of a value less than or equal to a given outcome.
PPF: Percent-Point Function, returns a discrete value that is less than or equal to the given probability.

There are many common discrete probability distributions.

The most common are the Bernoulli and Multinoulli distributions for binary and categorical discrete random variables respectively, and the Binomial and Multinomial distributions that generalize each to multiple independent trials.

In [3]:

# for inline plots in jupyter
%matplotlib inline
# import matplotlib
import matplotlib.pyplot as plt
# for latex equations
from IPython.display import Math, Latex
# for displaying images
from IPython.core.display import Image

In [4]:

# import seaborn
import seaborn as sns
# settings for seaborn plotting style
sns.set(color_codes=True)
# settings for seaborn plot sizes
sns.set(rc={'figure.figsize':(5,5)})

Bernoulli Distribution

The Bernoulli distribution is a discrete probability distribution that covers a case where an event will have a binary outcome as either a 0 or 1.

x in {0, 1}

A “Bernoulli trial” is an experiment or case where the outcome follows a Bernoulli distribution. The distribution and the trial are named after the Swiss mathematician Jacob Bernoulli.

Some common examples of Bernoulli trials include: The single flip of a coin that may have a heads (0) or a tails (1) outcome. A single birth of either a boy (0) or a girl (1). A common example of a Bernoulli trial in machine learning might be a binary classification of a single example as the first class (0) or the second class (1).

A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial, for example, a coin toss. So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, p, and the value 0 with the probability of failure, q or 1−p. The probabilities of success and failure need not be equally likely. The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (n=1).

The distribution can be summarized by a single variable p that defines the probability of an outcome 1. Given this parameter, the probability for each event can be calculated as follows:

P(x=1) = p
P(x=0) = 1 – p
In the case of flipping a fair coin, the value of p would be 0.5, giving a 50% probability of each outcome.

Its probability mass function is given by:

You can generate a bernoulli distributed discrete random variable using scipy.stats module’s bernoulli.rvs() method which takes p (probability of success) as a shape parameter. To shift distribution use the loc parameter. size decides the number of times to repeat the trials. If you want to maintain reproducibility, include a random_state argument assigned to a number.

In [5]:

from scipy.stats import bernoulli
data_bern = bernoulli.rvs(size=10000,p=0.6)

Again visualizing the distribution, you can observe that you have only two possible outcomes:

In [6]:

ax= sns.distplot(data_bern,
kde=False,
color="skyblue",
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Bernoulli Distribution', ylabel='Frequency')
C:\Users\user\New folder\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Out[6]:

[Text(0, 0.5, 'Frequency'), Text(0.5, 0, 'Bernoulli Distribution')]

Binomial Distribution

The repetition of multiple independent Bernoulli trials is called a Bernoulli process.

The outcomes of a Bernoulli process will follow a Binomial distribution. As such, the Bernoulli distribution would be a Binomial distribution with a single trial.

The performance of a machine learning algorithm on a binary classification problem can be analyzed as a Bernoulli process, where the prediction by the model on an example from a test set is a Bernoulli trial (correct or incorrect).

The Binomial distribution summarizes the number of successes k in a given number of Bernoulli trials n, with a given probability of success for each trial p.

A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is the same for all the trials is called a Binomial Distribution. However, The outcomes need not be equally likely, and each trial is independent of each other.

The parameters of a binomial distribution are n and p where n is the total number of trials, and p is the probability of success in each trial. Its probability distribution function is given by :

where:

Summary Of Methods for Binomial Distribution

rvs(n, p, loc=0, size=1, random_state=None)---> Random variates.pmf(k, n, p, loc=0)---> Probability mass function.logpmf(k, n, p, loc=0)---> Log of the probability mass function.cdf(k, n, p, loc=0) ---> Cumulative distribution function.logcdf(k, n, p, loc=0) ---> Log of the cumulative distribution function.sf(k, n, p, loc=0) ---> Survival function (also defined as 1 - cdf, but sf is sometimes more accurate).logsf(k, n, p, loc=0) ---> Log of the survival function.ppf(q, n, p, loc=0) ---> Percent point function (inverse of cdf — percentiles).isf(q, n, p, loc=0) ---> Inverse survival function (inverse of sf).stats(n, p, loc=0, moments=’mv’) ---> Mean(‘m’), variance(‘v’), skew(‘s’), and/or kurtosis(‘k’).entropy(n, p, loc=0) ---> (Differential) entropy of the RV.expect(func, args=(n, p), loc=0, lb=None, ub=None, conditional=False) ---> Expected value of a function (of one argument) with respect to the distribution.median(n, p, loc=0) ---> Median of the distribution.mean(n, p, loc=0) ---> Mean of the distribution.var(n, p, loc=0) ---> Variance of the distribution.std(n, p, loc=0) ---> Standard deviation of the distribution.interval(alpha, n, p, loc=0) ---> Endpoints of the range that contains alpha percent of the distribution

Generating Random Variables for a Binomial Distribution

You can generate a binomial distributed discrete random variable using scipy.stats module’s binom.rvs() method which takes n (number of trials) and p (probability of success) as shape parameters. To shift distribution use the loc parameter. size decides the number of times to repeat the trials. If you want to maintain reproducibility, include a random_state argument assigned to a number.

In [7]:

from scipy.stats import binom
data_binom = binom.rvs(n=10,p=0.5,size=10000)

In [8]:

ax = sns.distplot(data_binom,
kde=False,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Binomial Distribution', ylabel='Frequency')

Out[8]:

[Text(0, 0.5, 'Frequency'), Text(0.5, 0, 'Binomial Distribution')]

Alternatively, the distribution object can be called (as a function) to fix the shape and location. This returns a “frozen” RV object holding the given parameters fixed.

Freeze the distribution and display the frozen pmf:

In [30]:

import numpy as np
fig, ax = plt.subplots(1, 1)
n=10
p=0.5
x = np.arange(0,11)# ppf is the inverse of cdf
ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
#plot the frozen binomial distribution
rv = binom(n,p)
ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1,
label='frozen pmf')
ax.legend(loc='best', frameon=False)
plt.show()

In [31]:

x

Out[31]:

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

We can calculate the moments of this distribution, specifically the expected value or mean and the variance using the binom.stats() SciPy function.

μ=np

It is expected that 5 successes out of 10 trials would have the highest probability according to the binomial distribution above(which is also equal to the mean of the distribution).

In [32]:

n, p = 10, 0.5
mean, var, skew, kurt = binom.stats(n, p, moments='mvsk')
print('Mean=%.3f,Variance=%.3f'%(mean,var) )
Mean=5.000,Variance=2.500

Check accuracy of cdf and ppf:

In [45]:

prob = binom.cdf(x, n, p)
np.allclose(x, binom.ppf(prob, n, p))

Out[45]:

True

Let's check manually using the binom.cdf function

In [46]:

for k in range(11):
print('P of %d successes in 10 trials: %.3f%%' %(k, binom.pmf(k,n,p)*100))
P of 0 successes in 10 trials: 0.098%
P of 1 successes in 10 trials: 0.977%
P of 2 successes in 10 trials: 4.395%
P of 3 successes in 10 trials: 11.719%
P of 4 successes in 10 trials: 20.508%
P of 5 successes in 10 trials: 24.609%
P of 6 successes in 10 trials: 20.508%
P of 7 successes in 10 trials: 11.719%
P of 8 successes in 10 trials: 4.395%
P of 9 successes in 10 trials: 0.977%
P of 10 successes in 10 trials: 0.098%

This could be likened to a coin toss trial where n= 10, success is defined as getting a head. The question we are asking above is “What is the probability of getting k heads in n trials where k is 0,1,2….10”. Note that the P(k=0)=P(k=10)P(k=0)=P(k=10) for a fair coin where p =0.5

In [47]:

for k in range(11):
print('P of %d success in 10 trials: %.3f%%' %(k, binom.cdf(k,n,p)*100))
P of 0 success in 10 trials: 0.098%
P of 1 success in 10 trials: 1.074%
P of 2 success in 10 trials: 5.469%
P of 3 success in 10 trials: 17.187%
P of 4 success in 10 trials: 37.695%
P of 5 success in 10 trials: 62.305%
P of 6 success in 10 trials: 82.812%
P of 7 success in 10 trials: 94.531%
P of 8 success in 10 trials: 98.926%
P of 9 success in 10 trials: 99.902%
P of 10 success in 10 trials: 100.000%

Running the code above defines the binomial distribution and calculates the probability for each number of successful outcomes using the binom.pmf function. Remember x is an array of numbers from 1 to 10 inclusive.

We can see that 5 successful outcomes have the highest probability at about 24.609%

Given the probability of success is 50% for one trial, we would expect that a probability of 10 or fewer successes out of 10 trials to be close to 100%. We can calculate this with the cumulative distribution function, as demonstrated above.

Repeating the Coin Toss experiment

In the above examples, we tossed a single coin “n=10” times. That is our single experiment. Often to see how reliable our coin toss experiment is we might want to repeat the experiment multiple times( or think of throwing multiple coins). We can easily simulate multiple experiments with the option “size” in numpy.random.binomial function or scipy.binom.rvs function

Let us repeat our coin toss experiment 100 times, wherein each experiment we toss a fair coin 10 times. Let us ask how many heads we see in each of the 100 experiments.

In [86]:

np.random.binomial?

In [88]:

x= np.random.binomial(n,p,size=100)
n=10
p=0.5
# let us repeat our experiment for 100 times
size = 100
x

Out[88]:

array([5, 5, 7, 3, 5, 4, 5, 5, 7, 7, 6, 6, 6, 5, 2, 2, 7, 4, 4, 5, 2, 8,
3, 8, 6, 7, 2, 5, 8, 6, 5, 6, 4, 6, 4, 4, 5, 2, 2, 2, 6, 2, 3, 5,
5, 4, 4, 7, 3, 7, 4, 5, 8, 6, 4, 8, 6, 4, 5, 5, 5, 3, 6, 4, 4, 5,
4, 7, 5, 3, 3, 6, 7, 3, 5, 6, 7, 2, 4, 3, 5, 4, 5, 6, 8, 3, 6, 7,
5, 6, 4, 2, 3, 3, 7, 8, 5, 6, 4, 5])

In [89]:

ax = sns.distplot(x,
kde=False,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Binomial Distribution', ylabel='Frequency')

Out[89]:

[Text(0, 0.5, 'Frequency'), Text(0.5, 0, 'Binomial Distribution')]

Probability of seeing x heads out of n=10 coin tosses We started with a simple experiment, tossing a fair coin 10 times. And we repeated the experiment 100 times and measured how many successes/heads we observed. We can use the number of successes (heads) observed in many ways to understand the basics of probability. For example, we can simply count how many times we see 0 heads, 1 head, 2 heads with our fair coin toss, and so on.

We can see that, in our 100 experiments we never saw all heads and all tails with our fair coin (as the first and last element are zero). We can also see that we observe more times 4, or 5, or 6 heads. The above success counts sums to 100, our total number of experiments. We can use the observed successes to estimate the probability of getting x successes in n=10 coin tosses by dividing by 100.

In [90]:

probs_100 =[sum(np.random.binomial(n,p,size=100) == i)/100 for i in range(n+1)]
probs_100

Out[90]:

[0.0, 0.0, 0.07, 0.1, 0.22, 0.25, 0.18, 0.12, 0.03, 0.01, 0.0]

Let us plot the probability of x successes that we just computed.

In [91]:

plt.xticks(range(n+1))
plt.plot(list(range(n+1)), probs_100, color='blue', marker='o')
plt.xlabel('Number of Heads',fontsize=14)
plt.ylabel('Probability',fontsize=14)

Out[91]:

Text(0, 0.5, 'Probability')

We can see from the above plot that the probability of seeing 5 heads is the highest. Note that this observation might vary depending on the realization of our random simulation.

100000 repeated experiments

We know that it is a fair coin, so we would expect that if we repeat the experiment many more times we should see that the probability of seeing 5 heads should be the highest. So, let us repeat our coin toss experiment 100,000 times and compute the probability of seeing n heads like above. Also, remember that the observation might vary as this is a random process.

In [118]:

from ipywidgets import interact,IntSlider
def plot_probab(size=100):
np.random.seed(42)
n=10
p=0.5
# let us repeat our experiment for 100000 times
x=np.random.binomial(n=n, p=p, size=size)
probs_100= [sum(np.random.binomial(n,p,size=size) == i)/size for i in range(n+1)]
plt.xticks(range(n+1))
plt.plot(list(range(n+1)), probs_100, color='blue', marker='o')
plt.xlabel('Number of Heads',fontsize=14)
plt.ylabel('Probability',fontsize=14)


size_slider = IntSlider(min=100, max=100000, step=100, value=100, description='$\\nu$')
interact(plot_probab,size=size_slider);

Now we can see that the probability of seeing 5 heads is the highest as we expected. What is neat is that even if we don’t know if the coin is fair or not, if we do repeated experiments like above and observed the number of successes, we can infer if the coin is fair or not.

Tossing Biased Coins

Let us do our experiment with biased coins. Let us say we have a coin and we suspect it is a biased coin. Let us what we can infer about how unfair the coin is by repeated experiments like before.

Just as described before, let us toss the unfair coin 10 times, repeat it for 100,000 times, and count the number of successes. Let us use the number of successes to get the probability of x successes and plot it.

In [119]:

def plot_probab_unfair(size=100):
np.random.seed(42)
n=10
p=0.7
# let us repeat our experiment for 100000 times
x=np.random.binomial(n=n, p=p, size=size)
probs_100= [sum(np.random.binomial(n,p,size=size) == i)/size for i in range(n+1)]
plt.xticks(range(n+1))
plt.plot(list(range(n+1)), probs_100, color='blue', marker='o')
plt.xlabel('Number of Heads',fontsize=14)
plt.ylabel('Probability',fontsize=14)


size_slider = IntSlider(min=100, max=100000, step=100, value=100, description='$\\nu$')
interact(plot_probab_unfair,size=size_slider);

We can see from the above plot that the probability of successes is highest when the number of successes/heads is 7. Therefore, we can infer that the biased coin has a probability of success p=0.7.

Application of the Binomial Distribution(Real-Life Examples)

A company drills 9 wild-cat oil exploration wells, each with an estimated probability of success of 0.1. All nine wells fail. What is the probability of that happening?

Let’s do 20,000 trials of the model, and count the number that generates zero positive results.

In [122]:

sum(np.random.binomial(9, 0.1, 20000) == 0)/20000.
# answer = 0.38885, or 38%.

Out[122]:

0.38495

REFERENCES

  1. Jason Brownlee’s Mastering Data Science Blog
  2. Scipy.Stats

--

--

Maryam Raji
Analytics Vidhya

I am a Data Scientist in training, mobile web development enthusiast , writer who loves all forms of creativity. Oh! lest I forget,I am also a medical doctor.