Photo: Fahrul Azmi

Day-61 Math Behind the ML with Python-8 (Sampling Distributions and Hypothesis Testing)

Samet Girgin
PursuitOfData

--

Most statistical analysis involves working with distributions, usually of sample data. When working with statistics, we usually base our calculations on a sample and not the full population of data. That means to allow some variation between the sample statistics and the true parameters of the full population.

If we know the probability of occurring a binary event it’s pretty easy to calculate the expected value for a random variable indicating the number of event in a population. But what if we hadn’t known the probability of a search?

Creating a Proportion Distribution from a Sample:

Let’s suppose a security search gate example. We know that the each passenger will either be searched or not searched, and we can assign the values 0 (for not searched) and 1 (for searched) to these outcomes.

We can conduct a Bernoulli trial in which we sample 16 passengers and calculate the fraction (or proportion) of passengers that were searched (p), and the remaining proportion of passengers which are the ones who weren’t searched (1-p). Let’s assume we record the following values from our 16-person sample: →0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0

There were 3 searches out of 16 passengers; which as a proportion is 3/16 or 0,1875. This is our proportion and we call it (or p-hat) because it is sample-based.

Although this data is a categorical (be searched or not be searched), we’re using numeric values (0 and 1), we can treat these values as numeric and create a binomial distribution from them and we can also calculate statistics like mean and standard deviation

Using a single sample like above can be misleading because the number of searches can vary with each sample. Another person observing 100 passengers may get a (very) different result. Taking different samples and combining the results to form a sampling distribution is one method. Let’s see the below figures:

Creating a Sampling Distribution of a Sample Proportion and Plotting of this distribution

Central Limit Theorem: With a large enough sample size of binomial experiments, the distribution of values for a random variable started to form an approximately normal curve. This is the effect of the central limit theorem, and it applies to any distribution of sample data if the size of the sample is large enough.

The sampling distribution is created from the means of multiple samples, and its mean is therefore the mean of all the sample means. For a distribution of proportion means, this is considered to be the same as p (the population mean). Because the sampling distribution is based on means, and not totals, its standard deviation is referred to as its standard error, and its formula is:

If the sample size increases the mean remains constant but the amount of variance around it is reduced. (Central Limit Theorem)

Creating a Sampling Distribution of Sample Means:

The previous example includes discrete values that is the number of passengers searched or not searched. When we need to work with continuous data, we use slightly different formulae to work with the sampling distribution.

  • Let’s suppose we want to examine the weight of the hand luggage carried by each passenger. It isn’t possible to weigh every luggages so we take the data of 5 passengers at a time, 12 samples;
Visualizing the distribution for the sampling distribution of a continuous dataset. Look at the following Jupyter Notebook fro detailed info and code.

Mean and Variance of the Sampling Distribution:

  • Sample Mean: This is the mean for the complete set of sample data
  • Sample StdDev: This is the standard deviation for the complete set of sample data
  • Sampling Mean: This is the mean for the sampling distribution
  • Sampling StdErr: This is the standard deviation (or standard error) for the sampling distribution

Let’s assume that X is a random variable representing every possible bag weight, then its mean (indicated as μx) is the population mean (μ). The mean of the X sampling distribution which is indicated as μx̄ is considered to have the same value. Sample mean and sampling mean are so close each other.

In order to find the standard deviation of the sample mean, which is technically the standard error, we can use the right formula above. σ is the population standard deviation and n is the size of each sample. Because the population standard deviation is unknown, we can use the full sample standard deviation (𝑆𝐸𝑥¯)

Confidence Intervals:

A confidence interval is a range of values around a sample statistic within which we are confident that the true parameter lies.

The central limit theorem has resulted in a normal distribution. For our variable, so we can use a table of z-scores to determine the number of standard deviations above and below the mean within which 95% of the data falls, and then multiply by the standard deviation for our distribution. In a normal distribution, the z-score for 95% is 1.96; so our margin of error is -/+ 0.0466x1.96, which gives our confidence interval within which the mean will be in 95% of samples.

For example, our bag weight sampling distribution is based on samples of the weights of bags carried by passengers through our airport security line. We know that the value of the mean weight and we assume this is also the population mean for all bags; but how confident can we be that the true mean weight of all carry-on bags is close to the value?

Confidence intervals are expressed as a sample statistic ± (plus or minus) a margin of error. To calculate the margin of error, it is required to determine the confidence level that we want to find (for example, 95%), and determine the Z score that marks the threshold above or below which the values that are not within the chosen interval.

  • In Python, the scipy.stats.norm.interval function is used to calculate a confidence interval for a normal distribution.

Let’s dive into the notebook look over the codes to compute and visualize the concepts explained above.

Hypothesis Testing:

There are four steps in data-driven decision-making. First formulating a hypothesis. Second, finding the right test for your hypothesis. Third, executing the test. And fourth, making a decision based on the result. A hypothesis is an idea that can be tested.

Steps in data-driven decision-making

Single-Sample, One-Sided Tests: Let’s suppose a class that has completed its semester and the students has been asked to rate their class on a scale between -5 and 5. The class population include thousands of students and we’ll take a random sample of 50 ratings to assess the class. In this case, possible ratings were between -5 and 5, with a “neutral” score of 0. In other words, if our average score is above zero, then students tend to enjoy the course.

However this is just a sample, and we want to make a statement not just about your sample but the whole population from which it came. How can we test our belief that our positive looking sample reflects the fact that our population mean (not just your sample mean) is positive?

Let’s define two hypothesis:

  • The null hypothesis (H0): The population mean for all of the ratings is not higher than 0, and the fact that our sample mean is higher than this is due to random chance in our sample selection.
  • The alternative hypothesis (H1):The population mean is actually higher than 0, and the fact that our sample mean is higher than this means that our sample correctly detected this trend.

We call the number of standard deviations above the mean where our sample mean is found the test statistic (or sometimes just t-statistic), and we call the area under the curve from this point (representing the probability of observing a sample mean this high or greater) the p-value.

One-Tailed Test

So the p-value tells us how probable our sample mean is when the null is true, but we need to set a threshold under which we consider this to be too improbable to be explained by random chance alone. We call this threshold our critical value, and we usually indicate it using the α (Commonly a value of 0.05).

We calculate the t-statistic by performing a statistical test. Technically, when the standard deviation of the population is known, we call it a z-test (Normal distribution is also called z-distribution). The general formula for one-tailed, single-sample t-test is

x̄ is the sample mean, μ is the population mean, s is the standard deviation, and n is the sample size.

If the p-value is smaller than our critical value of 0.05, that means that under the null hypothesis, the probability of observing a sample mean as high as we did by random chance is low. That’s a good sign for us, because it means that our sample is unlikely under the null, and therefore the null is a poor explanation for the data. We can safely reject the null hypothesis in favor of the alternative hypothesis — there’s enough evidence to suggest that the population mean for our class ratings is greater than 0.

Conversely, if the p-value is greater than the critical value, we fail to reject the null hypothesis and conclude that the mean rating is not greater than 0. Note that we never actually accept the null hypothesis, we just conclude that there isn’t enough evidence to reject it!

Two-Tailed Tests: Previously, we deal with a one-tailed test in which the p-value represents the area under one tail of the distribution curve.

Let’s restat our hypotheses like this:

  • The null hypothesis (H0) is that the population mean for all of the ratings is 0, and the fact that our sample mean is higher or lower than this can be explained by random chance in our sample selection.
  • The alternative hypothesis (H1) is that the population mean is not equal to 0.
Two-Tailed Test

In a two-tailed test, we are willing to reject the null hypothesis if the result is significantly greater or lower than the null hypothesis. Our critical value (5%) is therefore split in two: the top 2.5% of the curve and the bottom 2.5% of the curve

Two-Sample Tests: Sometimes we might want to compare two samples against one another. Let’s suppose that some of the students who took the statistics course had previously studied maths, while other students had no previous math experience. We might hypothesize that the grades of students who had previously studied math are significantly higher than the grades of students who had not.

This will be a one-sided test that compares two samples.

  • The null hypothesis (H0) is that the population mean grade for students with previous math studies is not greater than the population mean grade for students without any math experience.
  • The alternative hypothesis (H1) is that the population mean grade for students with previous math studies is greater than the population mean grade for students without any math experience.

The results of this test is interpreted the same way as for the previous single-sample, one-tailed test.

Paired Tests: Sometimes it is needed to compare statistical differences between related observations before and after some change that might influence the data.

Let’s suppose students in a class took a mid-term exam, and later took a final exam. We could test for a general improvement on average across all students with a two-sample independent test. In order to compare the two test scores for each individual student , we need to create two samples; one for scores in the mid-term, exam, the other for scores in the end-of-term exam. Then we need to compare the samples in such a way that each pair of observations for the same student are compared to one another. This is known as a paired-samples t-test or a dependent-samples t-test. Technically, it tests whether the changes tend to be in the positive or negative direction. If the scores did in fact improve, so we can reject the null hypothesis.

The following Jupyter notebook includes some codes of applying the hypothesis tests.

This is the end of the math behind machine learning series. It takes one more pieces than I anticipated at the beginning but I hope it will be helpful for some people in their ML journey. Please follow me from my Twitter, Linkedin and Medium pages. Have a good ML journey.

References and Further Readings:

https://courses.edx.org/courses/course-v1:Microsoft+DAT256x+3T2019/courseware/aed039c14b2649ebad2607c5d65a2485/bb1cae9f448c48a78ca83dd1c3706a03/?child=first

Sampling Distribution of the Sample Mean, x-bar:

https://bolt.mph.ufl.edu/6050-6052/module-9/sampling-distribution-of-x-bar/

https://www.investopedia.com/terms/h/hypothesistesting.asp

--

--

Samet Girgin
PursuitOfData

Data Analyst, Petroleum & Natural Gas Engineer, PMP®