Sampling Distribution

  1. Whenever ‘distribution’ word is used, it is generally referred to probability distribution function (PDF). The x-axis of this plot has the random variables and the y-axis the probability of the values of the random variables. This can be easily plotted in pandas using the method df.plot.density(). df is the dataframe containing the data that needs to be plotted for its probability density.
  2. For a normal distribution we have this 1–2–3 rule, where the probability of the values showing up for the random variables plotted in the step 1 would be such that P(x = 1 sd) = 0.68, P(x = 2 sd) = 0.95, P(x = 3 sd) = 0.997. This property can be used to determine the probability of the random variable’s mean even when you do not know the actual mean. You can just know the standard deviation of the sample, and find the probability numbers at 1sd error or 2sd error or 3 sd error. That gives you the confidence of your prediction
  3. PDF for the means of the random samples taken out from a population is known as “Sampling Distribution”. The sampling distributions have the following 3 interesting properties postulated by the Central Limit Theorem.
Fig 1 : Central Limit Theorem CLT

There is no dependency of the CLT on the distribution of the parent population from which the random samples are retrieved! No matter what the distribution of the parent population, its sample distribution is always normally distributed for n>30!

3.1. Number of Samples [more — closer to the population mean] [ACCURACY]— As number of random samples increases the mean of the sample would be very close to that of the actual population. Here is the sampling distribution plotted by the method df.plot.density(). Number of samples taken in this case is 100. The population is of the size 75. Each sample is of size 5. This happens because more samples accommodate the diversity in the real data.

Fig 2 : 1st rule of CLT — Accuracy increases with increase in number of samples

3.2. Size of each sample (n) [more the tighter] [CONFIDENCE]— More is the size of each random sample, of the sampling distribution, the standard deviation of the sampling distribution would be lower — i.e, the distribution will be tighter as shown below. The standard deviation of the sampling distribution is known as the Standard Error S.E of the mean (SEM) .

NOTE THAT STANDARD ERROR OF MEANS (SEM) IS NOT THE STANDARD DEVIATION OF ONE SAMPLE. RATHER IT IS THE STANDARD DEVIATION OF SAMPLING DISTRIBUTION OF THE MEANS OF ALL THE SAMPLES!!!!

This is the uncertainty involved in inferring the population mean from the sampling distribution mean. SEM measures how far the sample mean of the data is likely to be from the true population mean. The SEM is always smaller than the SD. In the situations to find the confidence intervals on the actual population from the sample, we always use SEM and not the SD of the sample!

Fig 3 : Standard Error decreases as n increases

--

--