Introduction to Statistics for Data Science

Advanced Level — The Fundamentals of Inferential Statistics with Probability Distributions

We've covered the basics of Descriptive Statistics with the first two posts on this series. It is time to move on to Inferential Statistics, which are methods that rely on probability theory and distribution helping us to predict, in particular, the population’s values based on sample data.

We’ve seen that descriptive statistics provide information about our sample data by providing us with a concise summary of data. For example, we were able to calculate the mean and standard deviation of a player’s height from the English Premier League. Such information can provide valuable knowledge about a group of players.

On the other hand, inferential statistics uses a random sample of data taken from a population to describe and make inferences about the population. Basically, inferential statistics aims at drawing conclusions (or “inferences”) on populations based on the taken data samples.

Therefore, in the end, we can state that Descriptive statistics describes data (for example, a chart or graph) and Inferential statistics allows you to make predictions (“inferences”) from that data.

For this post these will be the concepts we will be analysing:

— Probability Distributions

  • Normal
  • Binomial
  • Poisson
  • Geometric
  • Exponential

And later, on other posts, we will be looking at:

  • Student’s T

— Points estimates

— Confidence Intervals

— Significance tests

  • One-sample T-test
  • Chi-squared goodness of fit test

What is a Probability Distribution?

“A probability distribution is a mathematical function that, stated in simples terms, can be thought of as providing the probability of occurrence of different possible outcomes in an experiment”. — Wikipedia

Another way to think about it is to see a distribution as a function that shows the possible values for a variable and how often they occur.

It is a common mistake to believe that the distribution is the graph when in fact it’s the “rule” that determines how values are positioned in relation to each other.

Here you have a map of relationships between the different distributions out there, with many naturally following Bernoulli distribution. Each distribution is illustrated by an example of its probability density function (PDF), which we’ll see later.


We will first start by focusing our attention on the most widely used distribution, Normal Distribution, due to the following reasons:

  • It approximates to a wide variety of random variables;
  • Distributions of sample mean with large enough sample sizes could be approximated to normal;
  • All computable statistics are elegant;
  • Heavily used in regression analysis;
  • Decisions based on normal distribution insights have a good track record.

Normal Distribution

Also known as Gaussian Distribution or the Bell curve, it is a continuous probability distribution, and it’s the most common distribution you’ll find. A distribution of a dataset shows the frequency at which possible values occur. It presents the following notation

with N standing for normal, ~ as distribution, μ being the mean and the squared σ the variance.

Normal distribution is symettrical and its median, mean and mode are equal, thus it does not have any skewness.

For a practical example, let’s use the median height of men in Portugal. On average a Portuguese male measures 174 cm with a standard variance of 8.2 cm. Taking these distribution into consideration we’ll use Python’s library SciPy.statsto generate 10.000 data points. We will use the method rvs() where the sizeis the number of points we wish to generate, loc is the mean and scale the standard deviation.

normal_data = stats.norm.rvs(size=10000, loc=174, scale=8.2, random_state=0)

pd.Series(normal_data).plot(kind="hist", bins=50)

We can clearly see that we now have a normal distribution for the height of men in Portugal, with the average Portuguese man measuring around 174 cm since the highest frequency of value agglomerate around this point. Now that we’ve obtained our distribution we can start to make use of some functions to obtain more insights on the data.

Cumulative distribution function: cdf()

This function gives us the probability of a certain random observation will have a lower value than the one provided by the user. For example, imagine we select a random male from the population and we get someone with 186 cm of height. What percentage of the Portuguese men will be shorter than this individual?

stats.norm.cdf(x=186,         # Cutoff value (quantile) to check
loc=174, # Mean
scale=8.2) # Standard Deviation

Therefore, we can state that around 99% of Portuguese men will be shorter.

Percent point function: ppf()

This function is the opposite of cdf() where instead of giving a quantile and receiving a probability we input the probability and receive the quantile.

We want to know which height are 83% of men!

stats.norm.ppf(q=0.83,      # Cutoff value (quantile) to check
loc=174, # Mean
scale=8.2) # Standard Deviation

We can say that around 83% of men measure less than 181 cm.

Probability density function: pdf()

This function gives us the likelihood of a random variable assuming a certain value, for example, the likelihood that by randomly choosing a man from the population he will have 154 cm of height.

stats.norm.pdf(x=154,         # Value to check
loc=174, # Distribution start
scale=8.2) # Distribution end

What about the likelihood of choosing someone, like me, who is 186 cm?

stats.norm.pdf(x=186,         # Value to check
loc=174, # Distribution start
scale=8.2) # Distribution end

Historically it is fun to notice that in the last 100 years, nowadays is most since common to find someone with a height of 186 cm than 154cm.

Therefore, we can conclude Portuguese people have grown a lot in the last decades! 💪

The Standard Normal Distribution

The Standard Normal Distribution is a particular case of the Normal distribution. It has a mean of 0 and a standard deviation of 1.

Every Normal distribution can be “standardized” using the following formula :

But you may be asking yourself “why would I want to standardize an already Normal distribution?”. Well with standardization you’re able to :

  • Compare different normally distributed datasets;
  • Detect normality;
  • Detect outliers;
  • Create confidence intervals;
  • Test Hypothesis;
  • Perform regression analysis.

Binomial Distribution

The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N. — Wikipedia

Take the example of randomly flipping a coin, with the binomial distribution you can describe this event. This is the notation for this type of distribution


where n is the number of events and p the success probability in each trial.

Continuing with the flipping coin example, let’s toss a coin 10 times and assume the probability of getting heads is 50% for each time. Therefore, n = 10 and p= 0.5 and our binomial distribution is equal to B(10,0.5).

Similar to our last example, let’s generate 10.000 data points assuming this distribution, meaning we’ll perform 10.000 experiments in which we flip a coin 10 times. Since we are dealing with a binomial distribution we’ll Python’s stats.binom and the rsv() .

binomial_data = stats.binom.rvs(size=10000, n=10, p=0.5, random_state=0)
pd.Series(binomial_data).plot(kind="hist", bins = 50)

Unlike the normal distribution which is continuous, this is a discrete distribution meaning the random variable can only assume discrete integer values. It looks like the normal distribution due to its symmetry, however this changes when we update the p value.

binomial_data = stats.binom.rvs(size=10000, n=10, p=0.8, random_state=0)
pd.Series(binomial_data).plot(kind="hist", bins = 50)

These are results are a bit biased, since we’re assuming that heads will be tossed 80% of the time and therefore our distribution “shifted” to the right.

Similar to what we’ve done before, let’s take a look at some functions.

Cumulative distribution function: cdf()

As seen before, this function will tell us the probability of a random variable assuming a value lower or equal to the one provided.

For example, what’s the probability of getting 7 head in 10 trials with a biased coin.

stats.binom.cdf(k=7,        # Probability of k = 7 heads or less
n=10, # In 10 trials
p=0.8) # And success probability 0.8

On the other hand, if you decided to ask the probability of getting at least 7 heads in 10 trails we can calculate by simply seeing the probability of not getting 6 or less heads in trials.

We express the “NOT” by subtracting the probability from 1, like this:

1 - stats.binom.cdf(k=6,        # Probability of k = 6 heads or less
n=10, # In 10 trials
p=0.8) # And success probability 0.8

Probability mass function: pmf()

For normal distribution we used the probability density function, however the binomial distribution is a discrete probability distribution therefore to check the proportion of observation at a certain point we need to make use of the pmf().

Let’s check the probability of getting exactly 5 heads in 10 tries, on our biased coin.

stats.binom.pmf(k=5,        # Probability of k = 5 heads
n=10, # With 10 flips
p=0.8) # And success probability 0.5

Poisson Distribution

The Poisson distribution is the distribution of a count — the count of times something happened. In opposite to what we’ve seen so far, it is not parameterized by a probability p or the number of trials n, but by an average rate λ.

The Poisson distribution is very useful where you are trying to count events over a period of time given the continuous rate of events occurring, such as the number of patients an hospital will receive on its Emergency Department in an hour. Therefore, we can think of this type of distribution as the probability of the number of times an event is likely to occur, within a certain timeframe.

Geometric Distribution

The Geometric distribution or the Negative Binomial distribution analyze the number of failures until r successes have occurred, not just 1. It is parameterized by the probability of that final success (p) not by the number of trials (n), because the number of failed trials is the outcome itself.

If for the binomial distribution you ask the question “How many successes?” for the negative binomial you ask “How many failures until a success?”.

Exponential Distribution

The exponential distribution is great for modeling scenarios like the time you need to wait before your bus arrives, knowing that there is a bus every 15 minute. The exponential distribution should come to mind when thinking of “time until event”, maybe “time until failure.”

Imagine a case where the Internet service in your neighborhood goes offline. Probably the support call service will be filled with angry customer’s calls. Take the perspective of the call center worker: how long until the next customer calls?

We can see this like Geometric distribution, where every second it passes without a customer call can be considered failure. The number of failures is the number of seconds nobody called which basically is the total amount of time until the next call. Since we’re dealing with time, if we take this geometrical distribution to the limit, towards infinitesimal time slices, we get an Exponential distribution! This type of distribution is continuous and like the Poisson distribution, it is parameterized by a rate λ.

If you liked it, follow me for more publications and don’t forget, please, give it an applause!

You the mighty reader applauding!