Probability Distributions in Data Science and Machine Learning | Part 2

Abhishek Barai
Analytics Vidhya
Published in
7 min readNov 24, 2020

Note: This blog is a continuum of “Probability Distributions in Data Science and Machine Learning | Part 1”. In case you haven’t read it yet, here is the link.

Here I am going to discuss various types of continuous probability distributions and their application in machine learning.

Continuous Probability Distributions:

Some of the standard continuous probability distributions are

  1. Normal/Gaussian
  2. Student’s t-distribution
  3. Exponential
  4. log-normal
  5. Power law and Pareto distribution

Normal/Gaussian Distribution:

The normal distribution is the backbone of statistics and data science. Many machine learning models work well with data that follow a normal distribution. Such as;

  1. Gaussian Naive Bayes Classifier
  2. Logistic, Linear Regression, and least square-based regression models
  3. Linear Discriminant Analysis(LDA) and Quadratic Discriminant Analysis(QDA)

Sigmoid function tends to work well in the case of normally distributed data. Some data may also exhibit another kind of distribution, which can later transform into a normal distribution using logarithms and square roots.

Mathematical Definition:

A continuous random variable “x” is said to follow a normal distribution with parameter μ(mean) and σ(standard deviation), if it’s probability density function is given by,

also called a normal variate.

Standard Normal Variate:

If “x” is a normal variable with a mean(μ) and a standard deviation(σ) then,

z = standard normal variate

Standard Normal Distribution:

The simplest case of the normal distribution, known as the Standard Normal Distribution, has an expected value of μ(mean) 0 and σ(s.d.) 1, and is described by this probability density function,

where — ∞ <z< ∞

Distribution Curve Characteristics:

  1. The total area under the normal curve is equal to 1.
  2. It is a continuous distribution.
  3. It is symmetrical about the mean. Each half of the distribution is a mirror image of the other half.
  4. It is asymptotic to the horizontal axis.
  5. It is unimodal.

Parameters:

For a standard normal distribution,

Distribution:

Normal Distribution with varying standard deviation

As we can see here, the random variable probabilities vary with the change of standard deviation, leading to a change in kurtosis.

To know more about normal distribution, please follow the link here.

Student’s t-distribution:

The Student’s t-distribution is one of the biggest breakthroughs in statistics. It allowed inference through small samples with an unknown population variance. This setting can be applied to a big part of the statistical problems we face today. It helps determine the parameters of a large population using small samples. This is also helpful in determining the level of significance in the hypothesis testing.

Visually a Student’s t-distribution looks much like a normal distribution but generally has fatter tails. Fatter tails allow for a higher dispersion of variables, as there is more uncertainty. The t-statistic is related to the Student’s t-distribution so that a Z-statistic is related to the standard normal distribution.

The formula that allows calculating t-statistic is,

x̅, s = sample mean and sd

t with (n-1) degree of freedom and a significance level of α equals the sample mean(x̅) minus the population means (μ) divided by the standard error of the sample.

As we can see, it is very similar to the standard normal variate or z-statistic. After all, this is an approximation of normal distribution.

The distribution can be described using a single parameter only, called degrees of freedom(v).

Usually, for a sample of n, we have (n-1) degrees of freedom(v). So for 20 samples of distribution, we have 19 degrees of freedom. In another way, we can say the number of degrees of freedom describes the number of pieces of information used to describe a population quantity.

Distribution:

Student’s t-distribution with varying degree of freedom

As we can see here, the increase in the degree of freedom leads to the normal distribution. Also, the tails are getting close to the x-axis.

Exponential Distribution:

The exponential distribution, also called inverse Poisson Distribution, is used to model time elapsed between two events. For example, the amount of time starting from now an earthquake occurs follows an exponential distribution. Suppose at t time the earthquake started and at (t+1) it ended. If we plot the distribution between the time t and (t+1), it will follow Exponential distribution.

The random variables in an exponential distribution have fewer large values and larger small values. For example, the shopping details of items in a grocery supermarket. People generally buy items with a small amount in bulk, but a few people buy items with a large amount. This is a general tendency.

Q. How is it an inverse case of Poisson Distribution?

Let’s take the below two cases.

  1. Number of cars passing a tollgate in one hour
  2. Number of hours between cars arrival

In the above cases, we saw that condition 1 asks for the number of cars per hour. It is dealing with the car amount. But in condition 2, we are specifying the time interval between a car arrives. If condition 1 follows Poisson distribution, then condition 2 will follow the exponential distribution.

Eample2, the Number of hipsters arriving at a bar in one minute and the number of minutes between new arrivals at the same bar. One follows Poisson distribution, whereas another Exponential.

Assumptions:

  1. Events must occur at a constant rate
  2. Events must be independent of each other

Formulation:

A random variable “x” is said to follow an Exponential distribution with probability density function,

λ>0, called the exponential parameter

Suppose we measure the life of a mobile phone. Then λ is called here the rate of failure of the mobile phone at time t(say), given that it has survived for time t.

Parameters:

Distribution:

log-normal Distribution:

A log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. If a random variable(X) is log-normally distributed, then f(X)= ln(X), where f(X)will follow normal distribution. Similarly, if f(X) follows normality then, X = e^(f(X)) will follow log-normality.

A random variable that is log-normally distributed only takes positive real values.

Distribution:

In real life, many natural phenomena that occur follow a log-normal distribution. Such as,

  1. The length of comments posted in Internet discussion forums follows a log-normal distribution
  2. Users’ dwell time on online articles (jokes, news) follows a log-normal distribution.
  3. In economics, there is evidence that the income of 97%–99% of the population is distributed log-normally

Power-law and Pareto Distribution:

In statistics, power-law states that a relative change in one quantity results in a significant change in another quantity. For example, when the length/side increases in two units in a square, the area increases by four units.

A power-law distribution has the form,

(x,y) variables of interest, “a” law exponent, “k” constant

The power-law can be used to describe a phenomenon where a small number of items is clustered at the top of a distribution(or at the bottom), taking up 95% of the resources. In other words, it implies a small amount of occurrence is common, while a larger occurrence is rare.

A specific type of distribution that follows power law is called Pareto distribution. The Pareto principle states that 80% of the effects come from 20% of the cause. For example, 80% of the world’s wealth is earned by 20% of the people. We can see that 80% of the words in a text corpus form only 20% of the unique words during text preprocessing.

Pareto Distribution:

The Pareto distribution is highly skewed and has a slowly decaying tail. It has two parameters.shape parameter(α)(tail index) and scale parameter(x_m). When the distribution is used to model wealth distribution, the parameter α is called the Pareto index.

So the probability density function of Pareto distribution is,

When plotted on linear axes, the distribution assumes the familiar J-shaped curve, which approaches each of the orthogonal axes asymptotically. All segments of the curve are self-similar (subject to appropriate scaling factors). When plotted in a log-log plot, the distribution is represented by a straight line.

Examples:

  1. Model the lifetime of a manufactured item with a certain warranty period.
  2. The size of meteorites.
  3. The standardized price returns on individual stocks.

Distribution:

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Abhishek Barai
Abhishek Barai

Written by Abhishek Barai

Data Scientist | Quantitative Researcher | Blogger

No responses yet