Interview Guide to Probability Distributions

Top probability distributions fundamental to Data Science and Artificial Intelligence related jobs.

Published in

Acing AI

6 min readNov 7, 2018

Math and Statistical knowledge is the backbone of data science. The part that distinguishes Data Scientists from software engineers is their prowess in statistics and probability.

What Probability distributions are to Data Science, Data Structures are to software engineering.

They are the fundamental building blocks which guide your decisions and model selection for experiments. This article aims to provide all the fundamental probability distributions which are required to be learned for a Data Science job and by that definition for a Data Science Interview.

Some interview questions based on probability distributions frequently asked in interviews:

Given a random generator that produces a number 1 to 5 uniformly, write a function that produces a number from 1 to 7 uniformly.
Three friends in Seattle told you it’s rainy. Each has a probability of 1/3 of lying. What’s the probability that Seattle is rainy?
There are 6 marbles in a bag — 1 is white. You reach in the bag 100 times. After drawing a marble, it is placed back in the bag. What is the probability of drawing that white marble at least once?
How to write a function to make a biased coin from a fair coin?
Estimate the disease probability in one city given the probability is very low nationwide. We randomly asked 1000 people in this city, with all negative response(NO disease). What is the probability of disease in the city?

Bernoulli and Uniform

When there is a tossing of a coin, we think of Bernoulli’s distribution. It represents a coin toss where 1 and 0 would represent “heads” and “tails” (or vice versa), respectively, and p would be the probability of the coin landing on heads or tails, respectively. The outcome of the experiment is boolean in nature. For example, the probability of getting a heads while flipping a coin is 0.5. The probability of tails is 1 — P (1 minus the probability of heads, which also equals 0.5 for a coin toss).

Probability Density function(PDF) for Bernoulli’s distribution:

Source: Wolfram

To learn more about Bernoulli’s distribution, like mean, standard deviation and variance, please visit wolfram mathworld.

Imagine rolling a fair die. The outcomes 1 to 6 are equally likely. It can be defined for any number of outcomes n or even as a continuous distribution. This is uniform distribution, characterized by its flat PDF. A uniform distribution, is also called a rectangular distribution, is a probability distribution that has constant probability.

Poisson

How would you model the count of customers clicking a particular link on your website each minute? The entire number of clicks in one day are modelled by a Poisson distribution. Poisson Distribution is applicable in situations where events occur at random bursts of time and space and our interest lies only in the total number of occurrences of the event. The limiting result is the Poisson distribution.

Where the rate of occurrence of some event, r (in this chart called lambda or l) is small, the range of likely possibilities will lie near the zero line. Meaning that when the rate r is small, zero is a very likely number to get. As the rate becomes higher (as the occurrence of the thing we are watching becomes more common), the center of the curve moves toward the right, and eventually, somewhere around when r = 7, zero occurrences will actually become unlikely.

When customers arrive at a movie theatre, to buy a ticket First Come First Serve(FIFO), the distribution is similar to a Poisson distribution.

For more detailed reading: The Poisson Distribution

Binomial

Summation of outcomes of a Bernoulli’s distribution is a Binomial distribution. When you toss a coin more than once and want to map the outcome, we use this distribution. When tossing the coin n times, count is an outcome that follows the binomial distribution. Its parameters are n, the number of trials, and p, the probability of a “success” (maybe heads). Each flip is a Bernoulli’s trial. Here, it should be noted that the flip of each coin is independent of the other flips.

The question 3 above, the count also follows a binomial distribution.

Let us imagine a situation in question 3, where you don’t put back the marble after you draw it. In that case, the distribution is hypergeometric. If the number of marbles is largely relative to the number of draws, the distributions are similar because the chance of success changes less with each draw.

When people talk about picking marbles or balls from bags without replacement, it is Hypergeometric in nature. More broadly, it should come to mind when picking out a significant subset of a population as a sample.

To intuitively play around with Binomial distribution values: Demonstration

To read in detail on Binomial distribution: Wolfram Binomial Distribution

Exponential

There is a strong relationship between the Poisson distribution and the Exponential distribution. For example, let’s say a Poisson distribution models the number of file requests on a server in a day. The time in between each file request can be modeled with an exponential distribution. The exponential distribution is mostly used for testing product reliability. It’s also an important distribution for building continuous-time Markov chains.

Poisson’s “How many events per time?” in an experiment relates to the exponential’s “How long until an event?”.

Usually in continuous-time related questions/problems when we think about ‘time until event’, an exponential distribution can be used to model it.

Normal and related distributions

Source: https://www.mathsisfun.com/data/standard-normal-distribution.html

Save the best for the last. This is perhaps the most important of all. Its bell shape is instantly recognizable. Lets take an example of the Heights of people. If we plot that data, we will see that the bulk of people will be in the middle of that graph. If you take a bunch of values following the same distribution — any distribution — and sum them, the distribution of their sum follows (approximately) the normal distribution. The more things that are summed, the more their sum’s distribution matches the normal distribution. The fact that this is true regardless of the underlying distribution is amazing.

Probably, there could be something asked in the interview which relates to central limit theorem which is the concept central to probability distributions and relates to normal distribution.

Important Properties of a Normal distribution:

The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean, μ).
Exactly half of the values are to the left of center and exactly half the values are to the right.
The total area under the curve is 1.

A log-normal (log-normal or Galton) distribution is a probability distribution with a normally distributed logarithm. If sums of things are normally distributed, then the products of things are log-normally distributed.

Let’s say you have a random sample taken from a normal distribution. The chi square distribution is the distribution of the sum of these random samples squared. It is based on the chi-squared test which is itself based on the sum of squares of differences, which are supposed to be normally distributed.

Conclusion:

The distributions above are the most commonly occurring distributions that the data scientists see and use day to day. The list should be a starting point and not a complete primer in any means to probability distributions.

Probability Distributions are at the heart of any Data Science or AI related problem. Hence, it is also at the heart of any Data Science Interview.

Sources: (In Addition to the one’s listed inline)

Statistics Distribution: NYU Stern

All probability distributions: Detailed map of all univariate distributions

Umass resources: Umass.edu resources

Wolfram world: Mathworld @ Wolfram