Overview of data distributions
How to choose the right distribution to model your data
There are over 20 different types of data distributions (applied to the continuous or the discrete space) commonly used in data science to model various types of phenomena. They also have many interconnections which allow us to group them in family of distributions. A great blog post proposes the following visualization, where the continuous lines represent an exact relationship (special case, transformation or sum) and dashed line indicates a limit relationship. The same post provides a detailed explanation of these relationships and this paper provides a thorough analysis of the interactions between distributions.
The following section provides information about each type of distribution depicting what phenomena it typically models, some example scenarios illustrating when it makes sense to choose the distribution, the probability distribution/mass function and its typical shape in a visualization.
Probability density function is a continuous approximation in terms of integrals of the density of a distribution or a smooth version of histograms. Cumulative distribution function can be expressed as F(x)= P(X ≤x), indicating the probability of X taking on a less than or equal value to x. PMF functions apply to the discrete domain and give the probability that a discrete random variable is exactly equal to some value.
Bernoulli distribution is a discrete distribution consisting of only one trial with 2 outcomes (success/failure). It constitutes the basis for defining other more complex distributions,which analyze more than one trial, such as the next 3 distributions.
Binomial distribution computes the probability of k successes within n trials. Like the Bernoulli distribution, trials are independent and have 2 outcomes. Examples of using this distribution are for estimating the number of heads given a series of n coin tosses or how many winning lottery tickets can we expect given a total number of tickets bought. This distribution has 2 parameters, n…