Hypothesis tests with Python

Valentina Alto
Sep 2 · 5 min read

In my previous article, I’ve been talking about statistical Hypothesis tests. Those are pivotal in Statistics and Data Science since we are always asked to ‘summarize’ the huge amount of data we want to analyze in samples.

Once provided with samples, which can be arranged with different techniques, like Bootstrap sampling, the general purpose is making inferences on real parameters, belonging to the original populations, by computing so-called statistics or estimators from our sample.

However, we need some kind of ‘insurance’ that our estimates are close to the reality of facts. That’s why we use Hypothesis tests.

In this article, I’m going to provide a practical example with Python, with randomly generated data, so that you can easily visualize all the potential outcomes of the test.

So let’s start by generating our data:

import numpy as np
mu, sigma = 3, 2
s = np.random.normal(mu, sigma, 10000)
import matplotlib.pyplot as plt
count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='r')
plt.show()

As you can see, I manually generated normally distributed data, with mean=3 and standard deviation=2. Now, the idea is extracting a sample from this population and checking whether it was actually extracted from a population with mean=3. For this purpose, since I want the visualization to be clear, I will create another manual distribution, still normal, but with a different mean ( just imagine it was taken from the population).

import numpy as np
sample_mean, sample_sigma = 1.5, 2
sample = np.random.normal(sample_mean, sample_sigma, 200)

Since I manually created a sub-sample of this population with mean deliberately less than 3 (actually 1.5), our hypotheses will be:

First, let’s have a look at both the distribution:

count, bins, ignored = plt.hist(s, 30, alpha=0.1, density=True)sample_count, sample_bins, sample_ignored = plt.hist(sample, 30, alpha=0.1, color='r',density=True)plt.plot(sample_bins,1/(sample_sigma * np.sqrt(2 * np.pi)) *np.exp( - (sample_bins - sample_mean)**2 / (2 * sample_sigma**2) ),linewidth=2, color='r')plt.plot(bins,1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='b')plt.show()

So in red we have our sample distribution, while in blue our real population distribution. In this case, we already know the answer to our problem: our sample does not arise from a population with the blue distribution, and it is obvious since I did not extract that sample from our population. However, what if you are not provided with the real population distribution? We need to inquire about the likelihood of the mean of our sample to be equal to that of our population.

Hence, let’s compute the confidence interval of our sample. Just to recall, a confidence interval of x% expresses that, given a population and a collection of samples from that, in 95% of those samples the sample mean (or whatever parameter you are inquiring about) will be included in that interval.

We can easily compute the interval, with confidence=95%, with a scipy tool:

import scipy
ci = scipy.stats.norm.interval(0.95, loc=1.5, scale=2)
count, bins, ignored = plt.hist(s, 30, alpha=0.1, density=True)
sample_count, sample_bins, sample_ignored = plt.hist(sample, 30, alpha=0.1, color='r',density=True)
plt.plot(sample_bins,1/(sample_sigma * np.sqrt(2 * np.pi)) *np.exp( - (sample_bins - sample_mean)**2 / (2 * sample_sigma**2) ),linewidth=2, color='r')
plt.plot(bins,1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='b')
plt.axvline(ci[0],color='g')
plt.axvline(ci[1],color='g')
plt.show()

Now let’s inquire at the possible outcomes of our test:

count, bins, ignored = plt.hist(s, 30, alpha=0.1, density=True)
sample_count, sample_bins, sample_ignored = plt.hist(sample, 30, alpha=0.1, color='r',density=True)
plt.plot(sample_bins,1/(sample_sigma * np.sqrt(2 * np.pi)) *np.exp( - (sample_bins - sample_mean)**2 / (2 * sample_sigma**2) ),linewidth=2, color='r')
plt.plot(bins,1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='b')
plt.axvline(ci[0],color='g')
plt.axvline(ci[1],color='g')
plt.fill_between(x=np.arange(-4,ci[0],0.01),
y1= scipy.stats.norm.pdf(np.arange(-4,ci[0],0.01),loc=1.5,scale=2) ,
facecolor='red',
alpha=0.35)

plt.fill_between(x=np.arange(ci[1],7.5,0.01),
y1= scipy.stats.norm.pdf(np.arange(ci[1],7.5,0.01),loc=1.5,scale=2) ,
facecolor='red',
alpha=0.5)

plt.fill_between(x=np.arange(ci[0],ci[1],0.01),
y1= scipy.stats.norm.pdf(np.arange(ci[0],ci[1],0.01),loc=3, scale=2) ,
facecolor='blue',
alpha=0.5)

plt.text(x=0, y=0.18, s= "Null Hypothesis")
plt.text(x=6, y=0.05, s= "Alternative")
plt.text(x=-4, y=0.01, s= "Type 1 Error")
plt.text(x=6.2, y=0.01, s= "Type 1 Error")
plt.text(x=2, y=0.02, s= "Type 2 Error")

plt.show()

As you can see, the sample mean (1.5) is included in the Type 2 Error area (meaning that we do not reject the null when it is false). To double-check, let’s compute the p-value, keeping in mind that our confidence level is 5% (hence, we do not reject the null if the p-value is greater than 5%).

z_score=(sample_mean-mu)/sigma 
p_value = scipy.stats.norm.sf(abs(z_score))
print('P-value= {}'.format(p_value))

if p_value<0.05:
print('P-value<alpha: reject H0')
else:
print('P-value>alpha: do not reject H0')

As you can see, our test confirmed what it is displayed in the picture above: we can say with 95% confidence that our sample has been extracted from a population with mean=1.5. Of course, we know that it is not true, since the real population has mean=3. So, how could we handle this inconsistency? The answer is that we can’t. I mean, we could shorten our confidence interval, but be aware that this could lead to a Type 1 Error (rejecting the null when it is true).

So, the idea is balancing the size of your confidence interval depending on the kind of task you are facing. Namely, if rejecting the null when it is true would mean a tremendous loss of revenues, you’d rather keep your confidence interval large enough, so that only truly extreme values would lead to a rejection of your Null.


Originally published at http://datasciencechalktalk.com on September 2, 2019.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Valentina Alto

Written by

Machine Learning and Statistics enthusiast, currently pursuing a MSc in Data Science at Bocconi University.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade