PMF, PDF and CDF in Machine Learning

Random variables and the various distribution functions which form the foundations of Machine Learning

Murli Jadhav
Analytics Vidhya
4 min readSep 10, 2019

--

Table of contents

  • Introduction
  • Random Variable and its types
  • PDF (probability density function)
  • PMF (Probability Mass function)
  • CDF (Cumulative distribution function)
  • Example
  • Further Reading

Introduction

PDF and CDF are commonly used techniques in the Exploratory data analysis to finding the probabilistic relation between the variables.

Before going through the contents in this page ,first go through the fundamental concepts like random variable, pmf, pdf and cdf.

Random variable

A random variable is a variable whose value is unknown to the function i.e, the value is depends upon the outcome of experiment

For example, while throwing a dice, the variable value is depends upon the outcome.

Mostly random variables are used for regression analysis to determine statistical relationship between each other. There are 2 types of random variable:

1 — Continuous random variable

2 — Discrete random variable

Continuous random variable:- A variable which having the values between the range/interval and take infinite number of possible ways is called Continuous random variable . OR the variables whose values are obtained by measuring is called Continuous random variable. For e.g, A average height of 100 peoples, measurement of rainfall

Discrete Random Variable:-A variable which takes countable number of distinct values. OR the variables whose values are obtained by counting is called Discrete Random Variable. For e.g, number of students present in class

PDF (Probability Density Function):-

The formula for PDF

PDF is a statistical term that describes the probability distribution of the continues random variable

PDF most commonly follows the Gaussian Distribution. If the features / random variables are Gaussian distributed then PDF also follows Gaussian Distribution. On PDF graph the probability of single outcome is always zero, this happened because the single point represents the line which doesn’t cover the area under the curve.

You can find deep insights on PDF and CDF here

PMF (Probability Mass Function):-

Fig:- Formula for PMF

PMF is a statistical term that describes the probability distribution of the Discrete random variable

People often get confused between PDF and PMF. The PDF is applicable for continues random variable while PMF is applicable for discrete random variable For e.g, Throwing a dice (You can only select 1 to 6 numbers (countable) )

CDF (Cumulative Distribution Function):-

Fig:- Formula for CDF

PMF is a way to describe distribution but its only applicable for discrete random variables and not for continuous random variables. The cumulative distribution function is applicable for describing the distribution of random variables either it is continuous or discrete

For example, if X is the height of a person selected at random then F(x) is the chance that the person will be shorter than x. If F(180 cm)=0.8. then there is an 80% chance that a person selected at random will be shorter than 180 cm (equivalently, a 20% chance that they will be taller than 180cm)

Python example for PDF and CDF on Iris Dataset:-

The iris data set contains the following data:-

Fig:- Flower image from iris dataset

The detailed explanation of iris data-set is here

PDF On Iris:-

PDF for [‘species’]== ‘setosa’ on petal length

CDF on Iris:-

CDf of iris_setosa using petal length

Both PDF and CDF visualisation:-

Pdf and Cdf

You will find the detailed explanation with python code on Github Here.

References:

--

--