Inferential Statistics 101 — part 1

Published in

GreyAtom

8 min readApr 2, 2018

Sampling distribution of sample mean and Central limit theorem

Joey, a manager in one of the famous e-commerce company is faced with some key business decisions, which he believes are important for increasing the revenue of the company. Since most of the decisions are data driven he decides to hire a data scientist to assist him. After conducting a series of interviews he hires a young and enthusiastic data scientist Chandler in his team.

Joey: Hi Chandler, Welcome to our team!

Chandler: Thank you Joey. In fact, it’s my pleasure to work with you.

Joey: Ok Chandler without any further delay let’s get started.

Joey: Our first project is to answer the question “Does the background colour of our website play any role in the number of clicks?” or in other words “Will the number of clicks increase if we change the background colour?”

Chandler: Sounds interesting! Did you check the number of clicks by changing the background colour of the website?

Joey: Yes, in fact we did more than that. We conducted an experiment by changing the colour and deploying the website for 8 hours, and recorded the average clicks per hour. And these are our findings:

Number of clicks before changing the colour: 1500/hour

Number of clicks after changing the colour: 1513/hour

Joey: The data shows that the average number of clicks has increased after changing the colour of the website. So I think it’s a good idea to change it. What do you think?

Chandler: It is just a sample. You can’t make a decision by just comparing the sample averages. You may get a different sample when you rerun your experiment. Why don’t you carry out the experiment again and record the number of clicks?

(Joey asks the IT team to rerun the experiment and record the number of clicks)

Joey: Chandler, it turn out that you were right! We reran the experiment as you suggested and recorded the number of clicks and here are the results:

The average number of clicks has gone down to 1441/hour.

So, how should we take a decision in such a kind of scenario?

Chandler: That is where inferential statistics come in handy. It is the branch of statistics which helps us to take decisions based on a sample.

Joey: Interesting! Can you explain more about it?

Chandler: Sure. But before getting into inferential statistics, I need to explain the building block for inferential statistics which is “central limit theorem” which goes as follows:

“The aggregation of a sufficiently large number of independent random variables results in a random variable which will be approximately a normal distribution”

Chandler: Don’t panic. I understand that I have thrown a lot of jargons let’s break it down and understand.

Variable: A variable is an entity which can take more than one values.

Non Random Variable: It is a variable whose value can be set by us. Example: Temperature of the microwave, no of clothes purchased in this month etc. These variables can take more than one values but you can set it according to our intent.

Random Variable: It is an opposite of non-random variable. It is a variable and it takes a value from a possible outcome set. Meaning, we can only measure or observe the values. Example: If you toss a coin, you wouldn’t know the outcome of a particular toss before tossing it, but you would know that outcome can be either head or tail.

To explain the independence part, consider the experiment of tossing a coin and rolling a die one after another. Let X be the outcome of the coin toss and Y be the outcome of the die roll. Clearly, X and Y are random variables. Here, X doesn’t get influence by the value of Y and vice versa. This is what is technically called “independent random variables”.

Joey: Interesting! So you mean to say that the number of visitors to our website is also a random variable whose value cannot be fixed by any means?

Chandler: Exactly! You are absolutely correct.

Joey: But is there any way to quantify the outcome of a random variable?

Chandler: Yes there is. We use something called probability distribution function (PDF) to quantify it. It assigns probability values to all the possible outcomes of a random variable.

For example the PDF of the random variable X (the outcome of a coin toss) is given by

Hence by using PDF we can quantify the outcome of a random variable.

Joey: Now I understand what the central limit theorem is trying to convey. If we sum/average (aggregation) large number of independent random variables, the resultant random variable approximately follows a normal distribution (a special type of distribution). Am I right?

Chandler: Exactly! Individual random variables can follow any distribution. But the moment we start aggregating them, the resulting random variable approximately follows a normal distribution.

Joey: That’s unbelievable Chandler! But I find it really hard to believe this theorem.

Chandler: Okay, let me walk you through a simple simulation to make this theorem clear.

Let’s assume you are rolling a die and let X1 be the random variable representing the outcome of this roll.

X1 can take any value from 1 to 6 with equal probability as shown in the figure below (We knew that the probability of rolling a specific number in six sided die is 1/6).

Now let’s assume we roll two dice simultaneously and the sum of the two dice rolls is our new random variable X2. This random variable can be seen as aggregation of X1’s, i.e. X1+X1.

X2 can take any value from 2 to 12 with a probability distribution as shown in the figure below.

X3 and X4 are defined along the same lines as

X3 — Sum of three dice outcomes. It is nothing but X1+X1+X1 (i.e., aggregation of three random variables)

X4 — Sum of four dice outcomes. It is nothing but X1+X1+X1+X1 (i.e., aggregation of four random variables).

The pdfs for all the random variables are calculated in the table below. It is not very difficult to enumerate the pdfs. For example, let’s take the value of sixth row and third column in the table; it answers the question “what is the probability of getting a sum of 6 on rolling two dice?”

There are five possible ways to get a sum of six on rolling two dice which are (1,5), (2,4), (3,3), (4,2) and (5,1), and the total number of possible outcomes are 36. Therefore the probability of getting a sum of six when we roll two dice is

Similarly, we can enumerate all the other values.

Figure A plot of the pdfs of the random variables X1, X2, X3 and X4, where the sum of dice outcomes are represented along x axis and their corresponding probability along y axis.

From the figure we infer that irrespective of the parent distribution, which happens to be uniform in this case (pdf of X1), the distribution of the sum of random variable converges to a normal distribution. The same argument holds true even for the averages.

Joey: So you are trying to say that number of average clicks per hours which we obtained is a random variable and it follows a normal distribution?

Chandler: Yes, and that is the reason why you got different numbers when you ran the same experiment twice (1531 in the first run and 1441 in the second). When we run the experiment every time, we will get a sample (also known as the realization of the random variable) from the distribution which we assume to be approximately distributed normally.

Joey: Okay. In the dice example, we knew the original distribution of the observations (i.e. pdf of X1 which is uniformly distributed). So, we can calculate the pdf of other random variable X2, X3 and X4 which are functions of X1. We have also seen that the approximation will be more accurate if we aggregate more number of random variables. But, in our dataset (e-commerce dataset), we don’t know the original distribution of the observations. So, how can we conclude something about the distribution of means (sums or averages of random variables)?

Chandler: That is the beauty of the central limit theorem. Irrespective of the parent distribution of the observations, the distribution of the means approximately follows a normal distribution. Distribution of means is also called as the sampling distribution of sample means. As per the central limit theorem if we take sufficiently large number of samples (ideally infinite) with size n (i.e. each sample contains n observations), calculate its mean and plot the sample means (x axis being the sample mean and y axis being the number of times the mean was obtained), then the resultant plot will approximately look like a normal distribution. But in reality we don’t take infinite samples, the concept of sampling distribution is just theoretical. But for now assume that we take infinite samples and plot the sampling distribution of sample means (Trust me for now. we will revisit this assumption when we read about the hypothesis testing). Additionally, it also concludes about the parameters of the normal distribution (i.e., the sampling distribution of sample means) which are discussed below:

· The standard error (also called as standard deviation of the theoretical distribution) of the sampling distribution is given by

Joey: Can you please check if my understanding about this is correct?

Till now, I understand that

· Our observations need not to be from a normal distribution for the central limit theorem to hold true.

· The distribution of sample means tends to follows a normal distribution.

· As the sample size (n) increases, the approximation of the sampling distribution of sample mean towards the normal distribution will be more accurate. In our case, if we record the number of clicks for more number of intervals (here we have eight intervals or in other words eight random variables) then the distribution of the sample means will be much closer to a normal distribution. Sampling distribution of sample mean is just a theoretical distribution, we will not recreate this distribution by taking infinite samples.

· The parameters (mean and standard error) of the sampling distribution are functions of the population distribution parameters and the sample size (n).

Chandler: Yes, you are absolutely right.

Joey: Okay. What is the use of this theorem?

Chandler: This theorem plays a major role in inferential statistics especially in confidence interval estimations and hypothesis testing. It’s been too much for the first day. We will discuss about those topics another day.

The author of this blog is Balaji P who is pursuing PhD in reinforcement learning at IIT Madras

Quora- www.quora.com/profile/Balaji-Pitchai-Kannu

Inferential Statistics 101 — part 1

Written by Shweta Doshi