Statistics & Probability — Hypothesis Testing

Is this result even possible?

Omar Elgabry
OmarElgabry's Blog
9 min readFeb 24, 2019

--

Thinking— source

This series of articles inspired by Statistics with R Specialization from Duke University. The full series of articles can be found here.

Have you ever came upon situations, outcomes, or events that just seem odd?

In a city made up of 51% women, where jury pools are said to be chosen at random, a certain jury pool of 50 people contains only 8 women.

When you hear things like this, they make you think. It doesn’t seem right. Is that even possible?.

And if it is possible, how likely is it that it could have happened at random?. Sometimes these questions and the related answers may help us make decisions.

Perhaps you work at a healthcare company. Your company has developed a drug to treat the common cold. When testing this new medicine on a random sample of 250 people with the common cold, it’s found that these patients recovered about 1.2 days sooner that those that did not take this drug.

Is this significant? Could this sample just be the result of chance, or did this drug have an impact?.

This is where hypothesis testing comes in.

Introduction to Inference

Statistical Inference is the process of drawing conclusions about the population from data.

(~) Bank managers were randomly given 48 resume of employees for promotion. Half of resumes are male, and half is female.

The percentage of males promoted were 21 out of 24 (88%). The percentage of females promoted were 14 out of 24 (58%). The difference is 30%.

Does the data provide a convincing evidence that there is a discrimination between male and female?

In general …

— 1. We start with two hypothesis.

Null hypothesis: There is nothing going on. The promotion and gender are independent. There is no gender discrimination. And, the observed difference in proportions is simply due to chance.

Alternative hypothesis: There is indeed something going on. The promotion and gender are dependent, there is gender discrimination. And, the observed difference in proportions, is not due to chance.

— 2. Run Hypothesis tests

We conduct a hypothesis test under the assumption that the null hypothesis is true.

Either via simulation, or using theoretical methods that rely on the central limit theorem. We’ll discuss both methods, mainly CLT, and simulation at the end.

— 3. Reject or Fail to reject Null hypothesis

If the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the null hypothesis.

Otherwise, we reject the null hypothesis in favor of the alternative.

Hypothesis Testing (for a mean)

There are 5 steps to conduct a hypothesis test.

(~) A sample of 50 students was collected to measure the average number of relationships students have been in. Given that the sample average = 3.2, and standard deviation = 1.74, and standard error = 0.246.

Do these data support the hypothesis that students on average have > 3 relationships?. Use 95% confidence interval.

— 1. Set the hypotheses

There are two claims (hypothesis).

  1. Null (current state, status quo):

In the null hypothesis, usually we call it H0, we set the parameter of interest (i.e. mean) equal to (=) some value.

Null hypothesis is that population average = 3

2. Alternative (what we want to test):

In the alternative hypothesis, usually we call it HA, often represented by a range of possible parameter values, either <, >, or != the null hypothesis value (that’s why we set H0 equal to 3).

Alternative hypothesis is that population average > 3.

The hypothesis are always about the population parameters and never about the sample statistics. We already know the sample statistics, we don’t need to hypothesize about them.

— 2. Calculate the point estimate (from a sample)

Get a sample and calculate the point estimate (i.e. mean) from it.

In this case, it’s the average number of relationships, which equals to 3.2

— 3. Check the conditions of CLT

As mentioned before, one of the ways to run hypothesis test is to use theoretical methods that rely on the central limit theorem.

And so, make sure all the conditions (independence & skewness) still hold.

— 4. Draw the distribution, calculate the the p-value.

Distribution — source

Hypothesis testing takes the concepts we introduced in CLT and CI, and use them when plotting the distribution.

Since we assume that the null hypothesis is true, so the distribution is normal, and centered around the null value (population mean).

With 95% CI, the white area under the curve is where most of the data lie. While the red area represent the outliers, that’s something is wrong.

And so, if the probability of the observed average of relationships is in the red area (%5 or less), then, we would say, the the observed data is statistically significant.

The p-value, is the probability of observed (or extreme) value, given the null hypothesis is true.

P(data | hypothesis) = P(sample mean >= 3.2 | H0: pop. mean = 3)

The extreme (unusual) value here means a value greater than the observed one.

How to calculate the p-value?

The zscore (also called test statics) 
= 3.2 — 3 (null value) / 0.246
= 0.81
The p-value = 1 - the percentile of zscore = 0.209

But, what does it mean?.

Under the assumption of null hypothesis, there is a 20.9% chance that a random sample of 50 students would yield sample mean of 3.2 or higher.

— 5. Interpreting the p-value, & making a decision

source
  • If the p-value is lower than the significance level alpha, a threshold (usually 5%), we say that it would be very unlikely to get the observed data, and therefore we reject the null, and we call such a result, a statistically significant result.
  • If the p-value is higher, we say that it is indeed likely to observe the actual data (≥ 3.2), and hence, we would not reject the null hypothesis.

Since our p-value is high, or in other words, higher than 5%, we fail to reject the null hypothesis.

These data do not provide convincing evidence that students have in more than 3 relationships on average. And the difference between the null value of 3 relationships and the observed sample mean of 3.2 relationships, is simply due to chance or variability.

Two-sided (two-tailed)

Instead of looking for a divergence from the null hypothesis in a specific direction (greater or less than), we might need to look in divergence in any direction.

When to choose two-sided Vs one-sided? Based on the alternative hypotheses. One-sided if its greater or less than, and two-sided if not equal.

source

In case of two-sided, the extreme value in p-value includes both directions.

The p-value is the probability of observed (or extreme) value …

P(sample mean >= 3.2 OR sample mean <= 2.4 | H0 population mean = 3) = The p-value = 2 x 0.209 (p-value of one-sided) = 0.418

(~) A sample of 36 of Mother’s IQ is collected with average 118.2, and sd = 6.5.

Perform a hypothesis test to evaluate if the observed difference between Mother’s IQ and population IQ mean is true. The population IQ mean = 100. Use alpha = 0.01.

source

—1. Hypothesis:

  • H0: null value = 100
  • HA: null value != 100

— 2. The point estimate (mean) = 118.2

— 3. Check the conditions of CLT

  • Independence: The sample is random and assumes < 10% of population.
  • Skewness: n > 30 and the sample distribution given shows its normal.

— 4. Distribution & p-value

source

Given its a two-sided hypothesis, the p-value =P(sample mean ≥ 118.2 OR sample mean ≤-118.2 | H0: population mean = 100) ~= 0

It means the probability of obtaining a random sample of 36 of Mothers who have IQ 118.2 or extreme on average, if in fact Mother’s IQ is truly 100 on average (null hypothesis), is almost 0.

— 5. Make a decision

Since p-value < alpha (1%), we have a very strong evidence against H0.

So, we reject it, and conclude that the sample data provided shows an evidence of a difference between the average IQ score and the average IQ score for the population at large.

Would you expect a confidence interval to contain the null value (100)?.

We rejected the null, so value=100 shouldn’t be in the interval.

Significance vs. Confidence Level

Whats the relationship between significance and confidence level?

In a two sided test. If we have a curve as below, where confidence level is 95%. This means the significance level is best at 5%. Why? p-value is within the 2.5% on right OR left.

In a one sided test. If the significance level is 5%, what should be the confidence level?. This means the confidence level is best at 90%. Why? p-value can only be within the 5% on right.

When using both in doing inference, make sure for both methods to agree with each other. Why? Because because anything above (or below) the confidence interval is considered “extreme value”.

To summarize:

  • A two-sided hypothesis with a threshold of alpha is equal to a confidence interval = 1 — alpha => both are compliment.
  • A one-sided hypothesis, with a threshold of alpha, is equal to a confidence interval = 1 — (2 x alpha)
  • If your confidence interval includes a null value, don’t reject it.
  • If your confidence interval does not include the null value, then you can go ahead and reject it.

Decision Errors

We can make wrong decision in statistical hypothesis tests. There are two types of errors: Error 1 and 2.

source

Though, we have the tools necessary to know the likelihood of making these errors. The likelihood of these errors are inversely proportional. So it’s not easy to keep both those error rates down.

Decision (fail to reject H0) Decision (reject H0)Truth (H0 true) valid invalid => You shouldn’t have chosen HA1-alpha          alpha (significance level)
Truth (HA true) invalid valid => You shouldn’t have chosen H0
  • Type 1 Error: Rejecting the H0 when the H0 is true.
  • Type 2 Error: Failing to reject the H0 when the HA is true.

If alpha = 5%. This means there is about a 5% chance of making a type 1 error. This is why we prefer small values of alpha to avoid Type 1 Error.

— Choosing alpha

If type one error is more dangerous or costly, we might choose a small value (even smaller than 5% ~=1%), the goal here is to be cautious about rejecting the null hypothesis, and so we demand very strong evidence favoring the alternative.

If the type two error is more dangerous or costly, we might choose a higher value (~= 10%). Increasing our alpha will have the effect of decreasing our type two 2 error. The goal here is to be cautious about failing to reject the null hypothesis when the null is actually false.

Thank you for reading! If you enjoyed it, please clap 👏 for it.

--

--

Omar Elgabry
OmarElgabry's Blog

Software Engineer. Going to the moon 🌑. When I die, turn my blog into a story. @https://www.linkedin.com/in/omarelgabry