Statistics Fundamentals for Machine Learning : Probability Distribution Function

8 min readOct 21, 2023

Algebraic Variable vs Random Variable

An algebraic variable is a symbol used in algebra to represent an unknown quantity. For ex.- In the equation 2x + 5 = 15, the variable x is an algebraic variable. Its value is unknown, and we can manipulate the equation to solve for x. By subtracting 5 from both sides and dividing by 2, we find that x = 5.

A random variable is a set of all possible values of a random experiment. For ex.- Consider rolling a fair six-sided die. The random variable X can represent the outcome of the roll, ranging from 1 to 6, i.e. X={1,2,3,4,5,6}

Types of Random Variable:- It has two types:

(i) Discrete Random Variable:- Takes discrete values like natural number. For ex.- Rolling a fair dice and random variable for it, X={1,2,3,4,5,6}

(ii) Continuous Random Variable:- Takes continuous values. One common example of a continuous random variable is the height of adult individuals. Heights can vary continuously within a range, and there are infinite possible values between any two given heights.

Probability Distribution

A probability distribution is a list of all possible outcome of a random variable along with their corresponding probability values.

In many scenarios, the number of outcomes can be much larger and hence a table would be tedious to write down, So we try to make a mathematical function which can relate with outcome and probability, and we call it Probability distribution Function.

Types of Probability distribution: (On the basis of Random Variable)

Why are Probability Distribution important ?

Gives an idea about the shape/distribution of the data.
And if our data follows a famous distribution then we automatically know a lot about the data.

Famous Probability Distribution Functions:

Types of Probability Distribution Function:

(1) Probability Mass Function

(2) Probability Density Function

Probability Mass Function:- Works on discrete random variables. For ex.- Probability mass function for rolling a dice is looks like below image:

An example of Probability Mass Function Graph where X-axis represents number and Y-axis represents corresponding Probability

Cumulative Distribution Function(CDF) of PMF:
In simpler terms, the CDF at a particular value x gives the probability that the random variable X takes on a value less than or equal to x. It sums up the probabilities of all possible outcomes up to and including x.

Let’s consider an example with a fair six-sided die:

PMF: P(X = 1) = 1/6 , P(X = 2) = 1/6, P(X = 3) = 1/6, P(X = 4) = 1/6,

P(X = 5) = 1/6, P(X = 6) = 1/6

To calculate the CDF for this PMF, we can determine the cumulative probabilities:

CDF: F(X ≤ 1) = P(X = 1) = 1/6
F(X ≤ 2) = P(X ≤ 1) + P(X = 2) = 1/6 + 1/6 = 1/3
F(X ≤ 3) = P(X ≤ 2) + P(X = 3) = 1/3 + 1/6 = 1/2
F(X ≤ 4) = P(X ≤ 3) + P(X = 4) = 1/2 + 1/6 = 2/3
F(X ≤ 5) = P(X ≤ 4) + P(X = 5) = 2/3 + 1/6 = 5/6
F(X ≤ 6) = P(X ≤ 5) + P(X = 6) = 5/6 + 1/6 = 1

In this example, the CDF tells us the cumulative probabilities of rolling a value less than or equal to a specific number on a fair six-sided die. For instance, F(X ≤ 3) gives the probability of rolling a 1, 2, or 3, which is 1/2.

2. Probability Density Function:- works on continuous random variable.

A sample example of Probability Density Function where X-axis represents number and Y-axis represents probability density which is not equal to exactly probability. The reason is for that X- axis values are continuous.

Important things about the above graph:-

Y-axis of the graph represents probability density which is not the probability because the values of X-axis are continuous and a probability of a particular value of X like 7.91234 is close to zero.
Probability density tells the probability of a specific number that lies between two number say a and b.
The area between a and b gives the probability of a particular number which is lies between a and b.

How to make a graph for Probability Density Function ?
The major problem here is how to calculate probability density ? To solve this problem we use Density Estimation Technique.

Density Estimation:

It is a statistical technique used to estimate the PDF(Probability Density Function) of a random variable based on a set of observations or data. There are various methods for density estimation, including parametric and non-parametric approaches. Parametric methods assume that the data follows a specific probability distribution(such as a normal distribution) pattern, while non-parametric methods do not make any assumptions about the distribution and instead estimate it directly from the data.

Parametric Density Estimation :

Parametric density estimation is a method of estimating the probability density function (PDF) of a random variable by assuming that the underlying distribution belongs to a specific parametric family of probability distributions, such as the normal, exponential, or Poisson distributions.

Suppose If you have a dataset containing the marks of a thousand students and your histogram of the data closely resembles a normal distribution curve, you can infer that your data follows a normal distribution pattern. With this understanding, you can proceed to calculate the mean and standard deviation of your dataset. Subsequently, you can apply these statistical parameters to the equation of the normal distribution curve to derive the probability density function that characterizes your dataset.

For implementing Probability Density Function using parametric approach, see my GitHub.

Non-Parametric Density Estimation:

But sometimes the distribution is not clear or it’s not one of the famous distributions.
Non-parametric density estimation is a statistical technique used to estimate the probability density function of a random variable without making any assumptions about the underlying distribution. The non-parametric density estimation technique involves constructing an estimate of the probability density function using the available data. This is typically done by creating a kernel density estimate.

Non-parametric density estimation has several advantages over parametric density estimation. One of the main advantages is that it does not require the assumption of a specific distribution, which allows for more flexible and accurate estimation in situations where the underlying distribution is unknown or complex. However, non-parametric density estimation can be computationally intensive and may require more data to achieve accurate
estimates compared to parametric methods.

Kernel Density Estimation (KDE): (A Non-Parametric approach)

The KDE technique involves using a kernel function to smooth out the data and create a continuous estimate of the underlying density function.

Here’s a step-by-step explanation of how KDE works:

Data Preparation: First, the dataset is prepared by arranging the observed values in ascending order.
Kernel Selection: A kernel function is chosen, typically a symmetric and smooth probability density function such as the Gaussian (normal) distribution. The kernel represents a smooth, bell-shaped curve.
Bandwidth Selection: The bandwidth parameter( here it is standard deviation) is determined. This parameter controls the width of the kernel and affects the smoothness of the estimated PDF. A larger bandwidth results in a smoother estimate, while a smaller bandwidth captures finer details in the data.
The kernels are placed at every data point, and their shapes and sizes are determined by a bandwidth parameter. This parameter controls how wide or narrow each kernel is and we assume each data point as a mean.

5. Next, the kernels are added up and combined to create the overall KDE curve. The contributions of nearby data points are taken into account, giving more weight to closer points and less weight to farther points.

6. The resulting KDE curve represents the estimated probability density function (PDF). It shows where the data is likely to be more concentrated (peaks) and where it is likely to be less dense (valleys).

7. The KDE curve is normalized so that the total area under the curve is equal to 1, representing a valid probability distribution.

For implementing KDE, see my GitHub.

Cumulative Distribution Function (CDF) of PDF:

Calculating area under curve of PDF graph, gives CDF graph and calculating slopes of CDF graph gives PDF graph.

How to use PDF in Data Science ?

We conducted an analysis on the Iris dataset, examining four distinct parameters: sepal length, sepal width, petal length, and petal width. For each of these parameters, we generated KDE plots to visualize their respective distributions. The results of these KDE plots are as follows:

Upon analyzing the Iris dataset, we determined that the two most discriminative parameters for classifying the iris flowers into the categories of setosa, versicolor, and virginica are petal length and petal width. This choice is based on the following observations:

For Petal Length:
- Flowers with petal lengths less than 2.5 are identified as setosa.
- Those with petal lengths between 2.5 and 4.8 are classified as versicolor.
- Flowers with petal lengths greater than 6 are recognized as virginica.

For Petal Width:
- Setosa flowers have petal widths below 0.9.
- Versicolor flowers fall within the range of 0.9 to 1.2 for petal width.
- Petal widths exceeding 2 indicate virginica.

In contrast, when analyzing the sepal length and sepal width, it becomes apparent that distinguishing between the flower categories is notably more challenging. Therefore, the petal length and petal width parameters serve as effective discriminators in separating the iris flowers into their respective categories. This approach leverages the Probability Density Function (PDF) to make informed classifications based on these two key parameters. So, in this way PDF helps in doing data science works.

Click here for GitHub link for the above discussion.

I hope this article is useful for you. If you find this video is knowledgeable, please like and share it. Thank you for reading this article.