Probability Theory: Random Variables & Probability Distribution Function

Ashish Arora
7 min readAug 3, 2023

--

As you know that based on sample data, we try to draw conclusions about the population’s parameters. However, to check the legitimacy of these conclusions we rely on probability, and to incorporate statistics with probability, we rely on probability distributions, which in turn require the concept of random variables.

The motto of this session is to introduce you to Random variables and build a solid foundation on Probability Distribution Functions as both of them are essential in creating a Probability Distribution.

Before jumping to Random Variables, let’s recall that the sample space is the big pool of all possible outcomes, and an event is like a smaller group or subset of outcomes associated with that event.

Random Variables

A random variable, on the other hand, is a numerical function that assigns a numerical value to each outcome in the sample space of a random experiment.

Random variables are denoted by letters such as “X,” “Y,” or “Z.”

Let’s consider a concrete example to illustrate how a random variable maps the outcomes of a sample space to real numbers.

Example: Tossing Two Coins

Suppose we are interested in the experiment of tossing two fair coins. Each coin can land either heads (H) or tails (T). The sample space for this experiment consists of all possible combinations of the outcomes.

Sample Space (Ω): {HH, HT, TH, TT}

Now, let’s define a random variable X as follows:

  • X(HH) = 3
  • X(HT) = 1
  • X(TH) = 2
  • X(TT) = 0

In this case, the random variable X is mapping the outcomes of the sample space to real numbers: 3, 1, 2, and 0.

Once we have defined a random variable, we can apply probability theory to analyze the uncertainty associated with its outcomes.

P(X = 1) = 1/4.

P(0 ≤ X ≤ 2) = 3/4.

Remember:

A random variable can be assigned to each possible outcome of a sample space, or we can assign a single random variable to a set of outcomes. The choice depends on the nature of the random experiment and the level of detail needed for analysis.

Types of Random Variables

Discrete Random Variable:

A discrete random variable is one that can only take on a countable number of distinct values. These values are typically represented by integers or a finite set of values.

Examples of discrete random variables include:

  • The number of heads obtained when flipping a coin.
  • The number of cars passing through a toll booth in an hour.
  • The outcome of rolling a six-sided die.

Continuous Random Variable:

A continuous random variable is one that can take on any value within a specified range or interval. These values are typically represented by real numbers.

Examples of continuous random variables include:

  • The height of a person.
  • The time it takes for a car to travel a certain distance.
  • Temperature record at a specific time.

Steppingstone of Probability Distribution

These random variables will be of no use until we assign a probability to each possible outcome of a random variable because random variables themselves do not inherently have probabilities associated with them.

These probabilities represent the likelihood of each outcome occurring when the random experiment is conducted.

Once we have assigned probabilities to the outcomes of a random variable, we can construct what is known as a probability distribution.

Probability Distribution

Probability distribution refers to a mathematical function or model that describes the likelihood of different outcomes or events occurring. It provides a systematic way to assign probabilities to various possible outcomes, allowing us to understand the relative likelihood of each outcome.

In simpler terms, A probability distribution provides a complete summary of the probabilities of all possible values that the random variable can take.

In the dice example, the probability distribution of X is a uniform distribution, as all outcomes have equal probabilities. It can be represented as follows:

You must be thinking that distribution should be represented in a graphical way than in a tabular format.

We could represent it in any format. Here the number of possible outcomes is small, so it was more feasible for us to represent it in tabular format.

However, the number of possible outcomes can be very large or infinite (continuous random variable), so to save ourselves from the tedious task of representing them in a tabular format, we formulate a mathematical function to model the relationship between outcome and probability and plot it.

Probability Distribution Function

The probability distribution function for the discrete random variable is known as Probability Mass Function whereas for the continuous random variable, it is known as Probability Density Function.

Although the role of both functions is to map probability values to each random variable, they differ in their calculation.

Probability Mass Function (PMF)

The calculation of PMF is very simple and straightforward and is completely based on the counting principle. Since Discrete random variables can only take on a finite number of distinct values, they are countable.

PMF = Random Variable Outcome/ Total outcomes in sample space

For example, when a four-sided dice is rolled twice, then the probability of the sum of two rolled dice and probability mass distribution would be:

Probability Density Function (PMF)

A PDF is used for continuous random variables.

Unlike the PMF, which gives probabilities directly, the PDF represents the relative likelihood of the random variable falling within a given range or interval.

Two Prime Reasons for it are:

  1. Continuous variables have an infinite number of possible values within a range or interval. For example, Distance could have an infinite number of values within a range.
  2. Due to this, the probability assigned to any single point is infinitesimally small or close to Zero.

Hence, in order to find the probability of any value in a continuous random variable, we determine the probability of the variable falling within a specific range.

Density estimation methods are used to calculate the probability density functions.

Density estimation methods

Density estimation is the process of estimating the underlying probability density function (PDF) of a random variable based on a given set of data points.

There are various methods for density estimation, including parametric and non-parametric approaches.

Parametric Density Estimation:

  • Parametric density functions make assumptions about the form of the underlying probability distribution.
  • In this approach, the PDF is defined by a set of parameters, such as mean and standard deviation, to estimate the values of these parameters to fit the data.
  • Examples of commonly used parametric density functions include the Gaussian (normal) distribution, binomial distribution, etc.

Suppose we assume that our distribution looks similar to normal distribution. Now what we will do here is:

  1. We will calculate the mean and standard deviation of our sample dataset.
  2. Supply it to a parametric Gaussian (normal) distribution function formula.
  3. Using that formula, we will build our probability density distribution.
  4. These distributions have predefined probability tables using which we can find the probability for a certain range.

Non-Parametric Estimation:

  • But sometimes the distribution is not clear or it’s not one of the famous distributions.
  • Nonparametric density functions do not assume a predefined form for the underlying probability distribution. Instead, they aim to estimate the density directly from the data.
  • Nonparametric methods are more flexible and can capture complex distributions without relying on predefined assumptions. Commonly used techniques for density estimation include KDE, histogram estimation, etc.

We are not going into the intricacies that how are KDE calculated since it is not important.

Once we get the density, we can then find the probability of range by finding the integration of the Area under the curve.

However, calculating the probability by directly integrating the probability density function (PDF) can be cumbersome and often impractical.

Hence mostly, we use parametric methods as they provide predefined probability tables.

Conclusion

In this post, we built a solid foundation in probability theory, random variables, and probability distribution functions. Understanding these concepts lays the groundwork for comprehending different types of distributions, which will be explored in upcoming posts. Stay tuned for more insights!

I hope you understand it.

Thanks, keep learning and supporting me in reaching more people!

Happy learning!

About Me:

I am working as a Production Head at India Woodline Ltd, an aspirant and passionate about fair and explainable AI and Data Science since 2020. I hold a postgraduate diploma degree in Data Science from III-T Bangalore and several Data Science and AI certifications. I hold a sound foundation in the supply chain, customer, and retail analytics.

Feel free to find me on:

Github

LinkedIn

--

--

Ashish Arora

An aspirant and passionate about fair and explainable AI and Data Science since 2020. I hold a postgraduate diploma degree in Data Science from III-T Bangalore.