Distributions — Understanding the statistical tools of python

Published in

EduTorq Community

4 min readJul 31, 2020

Even though the term ‘data science’ was coined recently in 2008, it doesn’t mean that the field was inexistent before that. The field of statistics has been a breeding ground for the emerging field of data science. The fundamental difference in approach between the two is that the statisticians are in need of the data to solve a problem that is already known to them while the data scientists have to formulate the right questions given the bulk of data available to them. The field of statistics is considered as the cornerstone of the data science processes (be it data analysis and visualization or the designing of machine learning models).

In this article, we are going to learn about the concept of distributions, the binomial distribution in detail and understanding the python tools available for the purpose.

Distributions — The concept:

We are well aware of the concept of the probability of simple and complex events. Considering a simple event such as rolling dice or a complex event such as the respective success rates of the COVID19 drugs in the market, the dataset of all of the possible outcomes and their respective number of occurrences over some sample is called as the distribution table of the event and it gives the idea of the probability of the occurrence of a particular outcome from the frequency distribution it has over a given sample.

For instance, let’s take a simple event of tossing 4 coins. Again let X be some random variable (random because it takes numerical values subject to a random phenomenon) whose value is equal to the number of heads in this statistical experiment. The possible outcomes of this experiment are:

TTTT, TTTH, TTHT, THTT, HTTT, TTHH, THHT, HHTT, HTHT, HTTH, THTH, HHHT, HHTH, HTHH, THHH, HHHH

If we repeat the experiment 100 times to check the number of heads and record the number of times X(=0,1,2,3,4) heads occur.

Results of a statistical experiment repeated 100 times to check the number of heads each time four coins are tossed at once

There are different types of distributions depending upon the shape and the nature of data:

★ Binomial distribution

★ Chi-square distribution

★ Normal distribution

★ Poisson distribution

★ Bernoulli distribution

★ Uniform distribution

★ Exponential distribution

We will discuss binomial distribution in this blog and how python has made calculations easy for us.

Binomial distribution

In this type of distribution, there are only two possible outcomes in each trial — success and failure and we have to determine the number of successes possible in N trails. The random variable, in this case, is discrete and is denoted by X ~ b (N, p), where p is the probability of success in a single trail. The trails are independent of each other in a binomial statistical experiment. The binomial distribution is thereby bi-parametric. The binomial probability function b(n|N,p) denotes n successes in N trials and is equal to

q = Probability of failure in a single trail (equals to (p-1))

The mean/expected value/average = Np, variance = Npq = Np(p-1)

When the value of p != q, then binomial distribution looks like:

When the value of p = q, then binomial distribution looks like:

Python provides a way to skip these manual calculations and draw the simulations of real-world scenarios in a simpler manner.

Python at scene

The NumPy library provides us with a way to implement the binomial distributions by creating its simulations.

The following code with default values is been explained:

numpy.random.binomial(n, p, size=None)

The hyperparameters are as follows:

n: Number of trails; should be integer > 0

p: Probability of success in one trial; interval [0,1]

size: Determines the shape of the output. It is an interesting hyperparameter that along with the nature of n and p (scalar or vectors) determines the number of samples that are drawn.

Example:

from numpy import randomrandom.binomial(n=100, p=0.5, size = 6)Output: [59 55 46 55 54 49]

Example :

Suppose we want to simulate the probability of flipping a fair coin 20 times, and getting a number greater than or equal to 15. Use np.random.binomial(n, p, size) to do 10000 simulations of flipping a fair coin 20 times, then see what proportion of the simulations are 15 or greater.

x = np.random.binomial(20, 0.5, 10000)print((x>=15).mean())Output: 0.0228

Example :

A company drills 9 wild-cat oil exploration wells, each with an estimated probability of success of 0.1. All nine wells fail. What is the probability of that happening?

Let’s do 20,000 trials of the model, and count the number that generates zero positive results.

sum(np.random.binomial(9, 0.1, 20000) == 0)/20000.Output: 0.38885, or 38%.

These statistical models play an important role in the design of more complex data science models.

We will learn more about statistical tools provided to us by Python in the future.

Stay tuned. Happy learning:)

Distributions — Understanding the statistical tools of python

Distributions — The concept:

Written by Hafsa Farooq