Intro to Probability Theory: What is Random Variable?

Hussain Anwaar
5 min readAug 17, 2023

--

The random variable, probability distribution, sample space & more!

https://unsplash.com/photos/cQYNJe7kP6Q

Statistics use probability theory and tricks to build methods that help us explain the structure of our data. If we look at any of the methods that are the backbone of modern day data analysis like Central Limit Theorem, Confidence Interval or Regression, we encounter words like random variable and probability distribution. Have we ever wondered what is “Random Variable, Probability or Distribution”. Why do we hear these terms all the time whenever we read any literature on statistics or probability theory. What do we mean when we say a random variable follows a certain distribution (or probability distribution)? — More importantly, why is it even important in the practical world of data?

In this post, i’ll explain what a random variable and probability distribution is. We’ll do an example from real data to demonstrate these concepts and do tiny bit of hands on coding for data analysis. Then we’ll dive into different types of random variables and distributions.

What is a Random Variable?

It is a variable that is used to model the outcome of an experiment and can take any value from the set of sample space based on a probability distribution

Not simple enough, huh!

Let’s try one more time. Anything that you observe in real life which can take a finite or infinite set of values and is associated to some phenomena is a “Random Variable”. Let’s start with the textbook example of a coin toss. If you want to observe the result of a coin toss, you model this using a random variable. why? Because anytime you throw a coin in the air, you can either get “Heads” or “Tails” with some chance (probability) associated to each of them. If i ask you “what is it gonna be, heads or tails” — what would you say? Probably, “I don’t know it could either be heads or tails with 50/50 chance right?”

Here I want to introduce a few terms that you might have heard of in probability courses.

  1. Experiment — Any phenomena that you observe or want to observe is the experiment — “Observing the coin toss”
  2. Outcome — The values that you can observe if you try to observe the phenomena at multiple points in time — “Observing the coin toss, where i can either see a head or a tail, so head and tail are the outcomes”
  3. Sample Space — The set of all possible outcomes of your experiment is sample space — “Observing the coin toss, where i can either see a head or a tail, thus my sample space is: [Head, Tail]”
  4. Probability — In inferential statistics, probability the long run average of the outcome of interest. In simple words what percentage of time we observe the outcome of interest if we repeat the experiment long enough — “If i do a coin toss a hundred time and count the number of times the head appears, then probability of head is 50/100 ~ 50%”
  5. Probability Distribution — A distribution that assigns probability to each outcome of your experiment — “Observing the coin toss, where i can either see a head or a tail, thus my sample space is: [Head, Tail], where i can observe either of them with a 50%, 50% chance, thus probability distribution is [Prob of Head, Prob of Tail] ~ [50%, 50%]”

I hope by now the picture is a bit clearer. Since, we live in the world of data, so i’d want to follow an example from data and see what we can learn there.

Let’s say you work at a gym where you have this data available which contains member_id, their status (whether they are regular members or not) and height in inches.

This data could be from an experiment you ran where you just wanted to record member’s id, their gym status & age. Now, let’s answer each of the above questions we answered with the coin toss example using this gym data.

Q.1: What is the probability of Pr(status = ‘Regular Member/R’)?
Remember that probability is the long run average of the outcomes. All we have to do is count the number of “regular” members out of all.

print(f"Regular Member count: {df[df.status == 'R'].shape[0]}")
print(f"Total Member count: {df.shape[0]}")
print(f"Pr(Status = 'Regular Member - R') = {df[df.status == 'R'].shape[0]/df.shape[0]}")

Q.2: What is the sample space for member’s status?
Remember that sample space is the set all possible outcomes. Since we know that there are only “Regular or Non Regular Members” in the gym, thus sample space is: [‘R’, ‘NR’]

Q.3: What is the probability distribution for member’s status?
Probability distribution is a distribution that assigns the probability (chance of occurrence) to each of the outcome in our experiment. To calculate this from our data, we just need to find the proportion (probability) of both “Regular and Non Regular Members”

status_pdf = df.groupby("status").agg(counts = ("status", "count")).reset_index()
status_pdf['probability'] = status_pdf['counts']/status_pdf['counts'].sum()
status_pdf
sns.barplot(data = status_pdf, x = "status", y = "probability")
plt.title("Probability Distribution of Gym status")

Now, we can use this information to make some inference about your gym. It seems like more of the members in your gym are “Non Regular” members, maybe you can strategize on helping members become more “regular” by introducing fitness programs or some other promotions to help with growing your business.

You see how probability, random variables and distributions are the backbone of simple data analysis. The above variable of “Member Status” is called “Discrete Random Variable” since it can only take a finite set of values, and the distribution used to describe is called “Discrete Probability Distribution”. If you noticed in the data, we had another variable called height, which is a continuous variable, meaning it can take infinitely possible values. E.g people can have a height of 55.6, 55.69, 56.2 inches and so on. Thus, these variables are called continuous random variables and the distributions used to represent these variables are called “Continuous Probability Distributions”.

In the next article, we will discuss the continuous probability distributions, some other terms that define the structure of a distribution like PDF (No not that PDF!), PMF and CDF.

--

--