Performing Statistical Estimation
An attempt to explain the estimation process in statistics in the simplest form.
Statistics, as we know, is the study of gathering data, summarizing & visualizing the data, identifying patterns, differences, limitations and inconsistencies and extrapolating information regarding the population from a sample.
This process of extrapolating information regarding the population from a sample is called Estimation. As it’s impossible to collect information from every single member of the population we instead gather information from a sample and work our way to estimate the information about the population. And in this process, we use something called an “Estimator” to generate the estimation.
When a value is calculated for the entire population it’s called a parameter and the corresponding term for a subset of the population also known as the sample is called a statistic.
What can be an estimator?
‘u’ (read mu) is referred to as the Mean of the population or the true average of the population. But in most cases, we don’t know this value, so instead, we try to determine a statistic called xbar which is the Mean of the sample. Xbar then becomes our estimator.
It may seem reasonable to use xbar to estimate u. But let’s suppose I take on this mission to determine the average height of all the people in my city. It’s really not possible for me to go to every single person in the city and ask them their heights. So instead I decide to choose a smaller sample i.e. people in my building to estimate the average height of people. Owing to my absent-mindedness, I end up making an incorrect entry of putting the decimal at the wrong place for one of the values (8.0 becomes 80.0).
My sample looks like this in inches: [4,5.3,5.2,5.5,5.8,6.0,6.1,80.0]. The mean ‘xbar’ of this sample is 14.73 inches. Are people really that tall on average? ‘Xbar’ in this case doesn’t seem like the best choice for estimating ‘u’.
So instead I decide to use another statistic called Median. The Median is nothing but the middle value. And the median for this sample is 5.65' which seems much more reasonable and close to the realistic value of ‘u’.
Is my estimator biased or unbiased?
Variance is a commonly used estimator which is used to determine the spread in the data usually given by the following formula:
But this formula of variance tends to underestimate the value of variance when we move to a larger sample (or population). In other words, it tends to be biased. Bias is nothing but the difference between the expected value and actual value or ‘xbar-u’. When this bias equals 0 we say that the estimator is unbiased. The above formula yields a non-zero bias. The proof of this can be found here: https://en.wikipedia.org/wiki/Bias_of_an_estimator
So, instead, we use the following formula to calculate the sample variance where we replace 1/n by 1/(n-1)
This formula yields a bias that is equal to zero. It also doesn’t underestimate population variance. Intuitively, since the denominator is a smaller number now, the value of variance is larger, and hence for a larger sample (or the population), we naturally expect a larger variance.
The sample mean is always an unbiased estimator of the population mean. That is because the expected value of mean equals the actual value or true mean of the population. Some samples might have a larger mean than the population mean and some might have a sample mean lower than the population mean. However, when this process is repeated over many iterations and the average of the estimates is calculated over these iterations, the mean of these sampling experiments will eventually equal the population mean.
How do we determine the best estimator?
That really depends on whether we are trying to minimize the error or maximize the chance of getting the right answer.
If we are trying to minimize the error we use something called Mean Squared Error or Root Mean Squared Error. In a real experiment, the process of calculating the estimator is iterated many times over numerous different samples. In the absence of an outlier, the sample mean, ‘xbar’, minimizes the Mean Squared Error (MSE) :
where ‘m’ is the number of iterations. RMSE is nothing but the Square root of MSE.
But is minimizing MSE/RMSE always the best choice? Imagine the case of a dice roll. The mean of a dice roll is (1+2+3+4+5+6)/6 = 3.5. We can never roll a 3.5 on a dice. Again, imagine we roll a 6 sided dice 3 times and we are asked to estimate the sum of the dice rolls. If we use the MSE approach and try to determine the value that minimizes the MSE, we would conclude that the expected value of the sum would be 3*3.5 = 10.5.
But the sum of dice rolls will never be a decimal number. In this case, we should choose an estimator which has the highest chance of getting the answer right also called Maximum Likelihood Estimator. For the dice roll case if we say the sum of 3 dice rolls would be 3 or 10 or 12, our chances of getting the answer right increase.
What is the distribution of the estimators?
Going back to my inquisitiveness for finding the average height of the people in my city. I picked up a sample of people in my building (where evidently I committed a mistake which I eventually corrected) and calculated a statistic — The mean of the sample. The sample size was 8 there. I wasn’t completely satisfied with my analysis, so I decided to iterate this process over more samples of the same size. I decided to extend this sample collection to 4 of my friends’ buildings. Each sample then yielded a statistic (Incorrigible me is still sticking with the sample mean despite its flaws and inability to cope up with outliers).
Following are the samples I have:
Sample1:[4,5.3,5.2,5.5,5.8,6.0,6.1,8.0] Mean1: 5.74 Std dev: 1.13
Sample2:[4.8,6.1,6.3,5.0,5.5,5.9,5.8,5.6] Mean2: 5.62 Std dev: 0.51
Sample3:[6.1,6.2,6.3,6.0,5.11,5.10,6.0,6.1]Mean3: 5.86 Std dev: 0.48
Sample4:[5.1,5.2,5.4,5.9,5.5,5.10,5.9,5.5] Mean4: 5.45 Std dev: 0.32
Sample5:[2.5,4.11,5.7,5.3,5.8,5.9,5.2,5.1] Mean5: 4.95 Std dev: 1.14
Now that I have some basic idea of mean heights, I decided to generate simulation over some 100 samples and calculate the means of each sample. I am using a normal distribution with a mean of 5.5, a standard deviation of 0.5 and a sample size of 8 to generate the sample means.
import numpy as np
import random
means = [5.74,5.62,5.86,5.45,4.95]
for i in range(100):
x = np.random.normal(5.5, 0.5, 8)
xbar = np.mean(x)
means.append(xbar)
This very distribution of the statistics i.e. the sample means is called the sampling distribution. Visualizing the distribution of the sample means:
import seaborn as sns
sns.displot(means,color='green')
Imagine, If I based my analysis solely on Sample3:[6.1,6.2,6.3,6.0,5.11,5.10,6.0,6.1]. From the looks of the sample, it seems like I chose the tallest of the people. This variation in the estimate due to random selection or noise, where a sample may not perfectly reflect the true population is called sampling error.
How confident are we about the estimate?
We looked at the sample mean, sample median, and sample standard deviation to estimate the corresponding values of the population. We are using one single value to represent the value we are trying to estimate. This is called a point estimate. When we use a range of values to estimate the corresponding values of the population, it’s called an interval estimate.
Our sampling distribution is characterized by a standard error and confidence interval.
The standard error measures, on average, how much do we expect the estimate to deviate from the true value, which is not the same as a standard deviation which is the variation in the sample itself. It is the standard deviation of the sampling distribution of the sample means that we see in Fig1. This, in turn, is nothing but the RMSE that we saw in the previous section i.e. square root of the average of all the values of ‘xbar — u’. In the above example, Standard Error turns out to be 0.17 inches:
u = 5.5
error = [(mean-u)**2 for mean in means]
rmse = math.sqrt(np.mean(error))
When ‘n’ is sufficiently large, n>30, the standard error will equal sigma/root of (n), where sigma is the true standard deviation.
Looking at Fig1, it’s intuitive to think of sample means of heights in terms of a range of values. A confidence interval is an interval estimate that provides a range of values that are included in a given fraction of a sampling distribution. The 90% confidence interval would be a range of values between the 5th and 95th percentile as seen in Fig2 i.e. around (5.2,5.7) inches. In simpler terms, what this means is 90% of our population parameter (u) will be in the range of 5.2-5.7 inches. What this does NOT mean is there is 90% probability of our population parameter (u) lying in the range of 5.2–5.7 inches.
import matplotlib.pyplot as plt
from matplotlib import mlab
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(means, density=1, cumulative=True, label='CDF', color='green')
plt.axhline(y=0.05, color='b', linestyle='-')
plt.axhline(y=0.95, color='b', linestyle='-')
plt.show()
Confidence Interval and Standard Error are characteristics of sampling distribution but they don’t take into consideration errors or noise arising due to sampling bias and/or measurement imprecision.
Going back to my obsession with finding the average height of people in my city. Instead of actually measuring their heights, I decide to send them email surveys to fill out. This method will have certain limitations. It is possible that a few of them don’t have an email or they don’t have the means to respond to my email or some may or may not have the time to fill the survey given a busy or not so busy schedule. I will then be sampling a certain type of population. This is called sampling bias. External factors like income, availability of resources, willingness to respond (self-selection: some responders refusing to respond) may be indirectly coming into play and affecting my estimation.
While responding to the email surveys, responders may under or overestimate or round up or round down their heights which could lead to measurement errors.
Conclusion
When making estimations it’s important to remember that confidence interval and standard error are useful but not the only indicators of sampling error and that estimation could very well be affected by other unaccounted errors like sampling bias and measurement inaccuracies.
REFERENCES:
This compilation is heavily influenced by Allan B Downey’s Thinkstats book.