Hypothesis tests with Python

Valentina Alto
Sep 2, 2019 · 5 min read

In my previous article, I’ve been talking about statistical Hypothesis tests. Those are pivotal in Statistics and Data Science since we are always asked to ‘summarize’ the huge amount of data we want to analyze in samples.

Once provided with samples, which can be arranged with different techniques, like Bootstrap sampling, the general purpose is making inferences on real parameters, belonging to the original populations, by computing so-called statistics or estimators from our sample.

However, we need some kind of ‘insurance’ that our estimates are close to the reality of facts. That’s why we use Hypothesis tests.

In this article, I’m going to provide a practical example with Python, with randomly generated data, so that you can easily visualize all the potential outcomes of the test.

So let’s start by generating our data:

import numpy as np
mu, sigma = 3, 2
s = np.random.normal(mu, sigma, 10000)
import matplotlib.pyplot as plt
count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='r')
plt.show()
Image for post
Image for post

As you can see, I manually generated normally distributed data, with mean=3 and standard deviation=2. Now, the idea is extracting a sample from this population and checking whether it was actually extracted from a population with mean=3. For this purpose, since I want the visualization to be clear, I will create another manual distribution, still normal, but with a different mean ( just imagine it was taken from the population).

import numpy as np
sample_mean, sample_sigma = 1.5, 2
sample = np.random.normal(sample_mean, sample_sigma, 200)

Since I manually created a sub-sample of this population with mean deliberately less than 3 (actually 1.5), our hypotheses will be:

First, let’s have a look at both the distribution:

count, bins, ignored = plt.hist(s, 30, alpha=0.1, density=True)sample_count, sample_bins, sample_ignored = plt.hist(sample, 30, alpha=0.1, color='r',density=True)plt.plot(sample_bins,1/(sample_sigma * np.sqrt(2 * np.pi)) *np.exp( - (sample_bins - sample_mean)**2 / (2 * sample_sigma**2) ),linewidth=2, color='r')plt.plot(bins,1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='b')plt.show()
Image for post
Image for post

So in red we have our sample distribution, while in blue our real population distribution. In this case, we already know the answer to our problem: our sample does not arise from a population with the blue distribution, and it is obvious since I did not extract that sample from our population. However, what if you are not provided with the real population distribution? We need to inquire about the likelihood of the mean of our sample to be equal to that of our population.

Hence, let’s compute the confidence interval of our sample. Just to recall, a confidence interval of x% expresses that, given a population and a collection of samples from that, in 95% of those samples the sample mean (or whatever parameter you are inquiring about) will be included in that interval.

We can easily compute the interval, with confidence=95%, with a scipy tool:

import scipy
ci = scipy.stats.norm.interval(0.95, loc=1.5, scale=2)
count, bins, ignored = plt.hist(s, 30, alpha=0.1, density=True)
sample_count, sample_bins, sample_ignored = plt.hist(sample, 30, alpha=0.1, color='r',density=True)
plt.plot(sample_bins,1/(sample_sigma * np.sqrt(2 * np.pi)) *np.exp( - (sample_bins - sample_mean)**2 / (2 * sample_sigma**2) ),linewidth=2, color='r')
plt.plot(bins,1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='b')
plt.axvline(ci[0],color='g')
plt.axvline(ci[1],color='g')
plt.show()
Image for post
Image for post

Now let’s inquire at the possible outcomes of our test:

count, bins, ignored = plt.hist(s, 30, alpha=0.1, density=True)
sample_count, sample_bins, sample_ignored = plt.hist(sample, 30, alpha=0.1, color='r',density=True)
plt.plot(sample_bins,1/(sample_sigma * np.sqrt(2 * np.pi)) *np.exp( - (sample_bins - sample_mean)**2 / (2 * sample_sigma**2) ),linewidth=2, color='r')
plt.plot(bins,1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='b')
plt.axvline(ci[0],color='g')
plt.axvline(ci[1],color='g')
plt.fill_between(x=np.arange(-4,ci[0],0.01),
y1= scipy.stats.norm.pdf(np.arange(-4,ci[0],0.01),loc=1.5,scale=2) ,
facecolor='red',
alpha=0.35)

plt.fill_between(x=np.arange(ci[1],7.5,0.01),
y1= scipy.stats.norm.pdf(np.arange(ci[1],7.5,0.01),loc=1.5,scale=2) ,
facecolor='red',
alpha=0.5)

plt.fill_between(x=np.arange(ci[0],ci[1],0.01),
y1= scipy.stats.norm.pdf(np.arange(ci[0],ci[1],0.01),loc=3, scale=2) ,
facecolor='blue',
alpha=0.5)

plt.text(x=0, y=0.18, s= "Null Hypothesis")
plt.text(x=6, y=0.05, s= "Alternative")
plt.text(x=-4, y=0.01, s= "Type 1 Error")
plt.text(x=6.2, y=0.01, s= "Type 1 Error")
plt.text(x=2, y=0.02, s= "Type 2 Error")

plt.show()
Image for post
Image for post
Image for post
Image for post

As you can see, the sample mean (1.5) is included in the Type 2 Error area (meaning that we do not reject the null when it is false). To double-check, let’s compute the p-value, keeping in mind that our confidence level is 5% (hence, we do not reject the null if the p-value is greater than 5%).

z_score=(sample_mean-mu)/sigma 
p_value = scipy.stats.norm.sf(abs(z_score))
print('P-value= {}'.format(p_value))

if p_value<0.05:
print('P-value<alpha: reject H0')
else:
print('P-value>alpha: do not reject H0')
Image for post
Image for post

As you can see, our test confirmed what it is displayed in the picture above: we can say with 95% confidence that our sample has been extracted from a population with mean=1.5. Of course, we know that it is not true, since the real population has mean=3. So, how could we handle this inconsistency? The answer is that we can’t. I mean, we could shorten our confidence interval, but be aware that this could lead to a Type 1 Error (rejecting the null when it is true).

So, the idea is balancing the size of your confidence interval depending on the kind of task you are facing. Namely, if rejecting the null when it is true would mean a tremendous loss of revenues, you’d rather keep your confidence interval large enough, so that only truly extreme values would lead to a rejection of your Null.

Originally published at http://datasciencechalktalk.com on September 2, 2019.

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Valentina Alto

Written by

Cloud Specialist at @Microsoft | MSc in Data Science | Machine Learning, Statistics and Running enthusiast

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Valentina Alto

Written by

Cloud Specialist at @Microsoft | MSc in Data Science | Machine Learning, Statistics and Running enthusiast

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store