Understanding Random Variables and Probability Distributions for Data Science/Machine Learning

Raushan Joshi
4 min readJan 23, 2023

--

Photo by Lucas Santos on Unsplash

Introduction

Going forward into the field of Data Science/Machine Learning, you are encounter various mathematical terms and notations. Among them, the most famous will be random variable and probability distributions. Thus, I will make you understand these concepts in a best possible way.

What is random variable ?

A random variable is a variable whose value is determined by the outcome of a random event or process. In other words, it’s a variable that can take on different values based on the outcome of a random experiment.

In mathematical terms: Given a probability space {Ω, Σ, P}, random variable is a function which maps each outcome of sample space(Ω) to the real number(R). In simple terms, random variable quantifies(in real number) the uncertainity about the quantity it represents.

Therefore, In probability terms we can say:

P(X = x) = P({ω | X(ω) = x}) ; X : random variable, x : value it takes, ω: all outcomes for which X(ω) = x.

Let’s understand by an example of throwing a dice twice. It has 36 possible outcomes like {1,1}, {1,2}….{6,6}. Define X as random variable which represents the possible sum of outcomes in those throws. Now, It’s sample space is {2,3,4,5,6,7,8,9,10,11,12}.

Thus, Probability of X = 4 is : P(X = 4) = (#{1,3}, #{2,2}, #{3,1})/36 = 1/12

There are two types of random variables: discrete and continuous.

A discrete random variable is one that can take on a countable number of distinct values. Examples of discrete random variables include the number of heads in a coin flip, the number of cars passing by a certain point in an hour, or the number of customers in a store at a given time.

A continuous random variable is one that can take on an infinite number of values within a certain range. Examples of continuous random variables include the time it takes for a light bulb to burn out, the weight of a bag of sand, or the temperature of a room.

Note: In Machine Learning, random variables are used to model the underlying processes generating the data and to make probabilistic predictions.

What is Probability distribution and how it is useful ?

A probability distribution is a function that describes the likelihood of different outcomes of a random event or process. It is used to model the behavior of a random variable. In simple terms, It is the graph representing P(X) vs all possible outcomes in a random process.

Based on the type of random variable, we have:

A discrete probability distribution is used to describe a discrete random variable. Examples of discrete probability distributions include the binomial distribution, the Poisson distribution, and the geometric distribution

A continuous probability distribution is used to describe a continuous random variable. Examples of continuous probability distributions include the normal distribution, the exponential distribution, and the uniform distribution.

Furthermore, the sum of all the probabilities of the discrete random variable is 1 and the integral of the probability density function for continuous random variable is 1 over the range of the variable.

Out of various probability distributions, the normal probability distribution is the most widely used model in statistical analysis. One of its key properties is that it is symmetrical and bell-shaped. It is a continuous distribution too.

Normal distribution is also known as Gausian Distribution. N(µ, σ²)

Note: Probability distributions are widely used in machine learning to model and analyze data, make predictions, and make decisions under uncertainty.

Expectation of a Random Variable and How to calculate it ?

Given a random variable X, expected value of X represents its long-term average value based on a large number of observations or trials. In simple terms, It is also said to be weighted average of its possible values, where weights are the probabilities of each value.

In mathematical term,
For discrete random variable X, Expected value is:
E(X) = Σ (x* P(X = x)) ; for all possible values of x.

And for continuous random variable X:
E(X) = ∫x*f(x)dx

The expectation of a random variable is a useful measure of the center of the distribution. Additionally, it’s important to note that the expectation exists only if the integral converges.

Let’s calculate expected value of a random variable which represents outcomes when dice is thrown once {1,2,3,4,5,6}.
E(X) = (1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6) = 3.5
Also, E(X²) = (1*1/6 + 4*1/6 + 9*1/6 + 16*1/6 + 25*1/6 + 36*1/6) = 15.16

Note: Expectation of a random variable is used to understand the properties of a data distributions, which further helps to model a machine learning problem.

Thanks for the reading. I hope that above points will help you to grasp these basic concepts that you will deal very often in your journey forward. Also, refer here for learning the Statistics concepts.

--

--