Probability Distributions in Machine Learning

Mansi Arora
8 min readSep 8, 2019

--

Distributions are an integral part of Machine learning as it helps to analyze the data. Probability provides the theoretical concept behind it,whereas Distributions helps us to visualize the data.

Prerequisites :

There are two type of variables:

  1. Discrete Variable:As discrete refers to each individual values.For example, number of students in the class,test questions answered correctly.
  2. Continuous Variable:It refers to the variable of a data where information is measured on a scale.For example,height of students in the class,temperature etc

Topics of content:

  1. Gaussian /Normal Distribution
  2. Uniform Distribution
  3. Binomial Distribution
  4. Bernoulli Distribution
  5. Log Normal distribution
  6. Power Law Distribution

Normal Distribution(mean=0,variance=finite)

It is also called as Gaussian distribution.In this type of distribution,the data is symmetric about the mean value,showing that data is more frequent near mean.This distribution is sometimes also called as Bell Curve distribution because of its shape.

For x,which is random continuous variable,the probability distribution function(PDF) is plotted using this equation,which is the Probability Density of Normal Distribution

Probability Density of Normal Distibution

where,

x =random continuous Variable

σ =standard deviation and σ²=variance

μ=mean

if suppose: μ=0 and σ²=1

then this equation will become in the form of (ignoring the constant terms)

y=exp(-x²) ,which looks like a bell curve

then,this follow the curve like this, as the value of μ=0 i.e the mean is at 0 ,if the value of variance increases the curve will becomes more fatter and the peak decreases,as shown in the curve by red and orange lines.

Properties of Normal distributions:

  1. As x increases,value of x² will increase and the equation of y=exp(-x² ) will decreases.
  2. Its a symmetric curve as,its left side is equal to right side i.e its never a skewed distribution.
  3. The curve follows,68–95–99.7 rule,which states that the 68% of the distribution will be till 1σ and 95% will be till 2σ and 99.7% till 3σ
68–95–99.7% Rule

4. Cumulative Density Function:

It is defined as Probability that X will take a value less than x and it is calculated as the area under the PDF curve.

In the curve,as the value of value of variance increases, the curve goes away from the horizontal line of x=0 as shown by the Blue,Red and yellow lines.

Standard Normal Variable(z):

z~ N(μ=0,σ²=1)

where z is a random variable following Normal Distribution of mean=0 and variance=1

So, for a x ={ x1,x2,x3………} following Normal distribution having μ,σ²

x~ N(μ,σ²)

then,they are standardized by converting each xi in x by

xi’=xi -μ/σ

so that now,x~ N(μ=0,σ²=1)

Q-Q plots:

It is graphical technique which is used to determine if distribution is Gaussian/Normal distribution or not?

Steps to make Q-Q plots:

  1. Sort all the xi’s in ascending order and compute its percentiles.
  2. Y~ N(μ=0,σ²=1),which Y is a Standard Normal Distribution where yi’s are sorted in ascending also and yi’s are percentiles of yi’s
  3. Now,Q-Q plot is drawn using percentile values of x and y,where y are on the x-axis which are the theoretical quantities and x are on the y-axis.
  4. If y for all yi’s and x for all xi’s are on the straight line then,if they are then the distribution on the y-axis is Gaussian Distributed.
Q-Q plot construction

Limitations of Q-Q plot:

If the sample size of the distribution is less,then it is hard to find out if the distribution is Gaussian or not.

2. Uniform Distribution

2.1 Discrete Uniform Distribution:

Probability Mass Function (which is calculated for Discrete Random Variable), is defined as the when finite number of variables are equally likely to be observed,where every variable has probability of 1/n where n is no of finite values

Notation: U{a,b}or unif{a,b}

where ,

b≥a and n=b-a+1

so,here n=5 where a=2 and b=6 and each variable has probability of 1/6.

2.2 Uniform Continuous Distribution:

Probability density Function (which is calculated for Continuous Random Variable), is defined as a symmetric Probabilistic distribution where random variables in the interval are equally probable having probability of 1/(b-a)

Notation:U(a,b)

where a and b are defined as minimum and maximum values

3. Binomial Distribution and Bernoulli Distribution:

It is the discrete probabilistic distribution with parameters n and p,where n is number of trails(independent trails) and p is the success probability of each trial,where each variable has Boolean valued outcome(Yes i.e Probability =p/ No i.e Probability=1-p)

Notation: X~B(n,p)

Here,Probability Mass Function for a random variable X is defined when

n € N(population size) and p €[0,1]

where, k success occurs with p^k probability and n-k failures with (1 − p)nk probability

Binomial coefficient

4. Special Case: when n=1,then the distribution is called as Bernoulli Distribution which takes random variable take the output value as 1 as success probability(p) and q=1-p.

Probability mass function,

5. Log Normal Distribution:

It is continuous probability distribution of random variable whose logarithm is normally distributed.Here,random variable which is log normal distribution will take only take positive values.

ln(X) ~ N(μ,σ²)

Probability density function is similar to Gaussian Distribution,

Here,as the value of variance increases the curve becomes fatter.

Importance of log normal Distribution:

These type of distributions occurs a lot in E-commerce ,Human behaviors,Science,Technology and many more fields..

  1. The length of comments posted in Internet discussion forums follows a log-normal distribution.
  2. The length of comments posted in Internet discussion forums follows a log-normal distribution.
  3. In computer networks and Internet traffic analysis, lognormal is shown as a good statistical model to represent the amount of traffic per unit time.

6. Power law Distribution:

Power law is a function where relative change in one quantity results in proportional relative change in other quantity i.e one quantity changes as a power of other.

Here,in the graph below,tail of the curve is very long whereas on the left,the number of points dominate.It follows a 80–20 Rule.

80–20 rule is called as Pareto Rule and the distribution following this is called as Pareto Distribution or Power Law distribution ,for example,80% of the wealth of society is held by 20% of the population.

Power Law Curve

PDF: where alpha which is called as tail index or shape parameter which is also called as positive parameter,if this distribution is used to model the wealth distribution then this index is called as Pareto index.

As alpha becomes infinite,the curve goes to become a Dirac Delta Function where the value of function is zero everywhere except for one point here,at 1.

PDF of Pareto Distribution

Relation between log normal Obtrusion and Pareto distribution:

The Pareto distribution and log-normal distribution are alternative distributions for describing the same types of quantities. One of the connections between the two is that they are both the distributions of the exponential of random variables distributed according to other common distributions, respectively the exponential distribution and normal distribution.

How to determine if the distribution is Pareto Distribution or not?

This is done by plotting the graph of logarithm x and Logarithm y on the x-axis and y-axis.

If the curve,follows the straight line,then the distribution is Power loss Distribution.

How to convert Power law Distribution to Gaussian Distribution:

Box Cox Transform:

A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape.

At the core of the Box Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve. The transformation of Y has the form:

For example

Non-Normally Distributed cycle data
Box-Cox plot of the data

Box-Cox Plot

The Lambda value indicates the power to which all data should be raised. In order to do this, the Box-Cox power transformation searches from Lambda = -5 to Lambda = +5 until the best value is found.

The Box-Cox power transformation is not a guarantee for normality. This is because it actually does not really check for normality; the method checks for the smallest standard deviation. The assumption is that among all transformations with Lambda values between -5 and +5, transformed data has the highest likelihood — but not a guarantee — to be normally distributed when standard deviation is the smallest. Therefore, it is absolutely necessary to always check the transformed data for normality using a probability plot.

Additionally, the Box-Cox Power transformation only works if all the data is positive and greater than 0. This, however, can usually be achieved easily by adding a constant c to all data such that it all becomes positive before it is transformed.

--

--