# Hypothesis Testing in Machine Learning: What for and Why

## Checking your train and test data for statistical significance and some other applications

Personally, I consider myself a Data Analyst who can do Machine Learning. I’m not a maths expert, I don’t have a PhD at all and I’m not either a computer engineering. I’m just a regular guy who always loved data and finds machine learning exciting and super funny. That’s why always I start working on something, whether it a personal project or in my role within the Data Science team at Ravelin Technology, I’m always pushed to do a lot of reading and researching in order to understand my problem and the possible solutions. This article does not attempt to be an exhaustive and technical perfect guide about hypothesis testing. So if that’s what you’re looking, I’m sorry but unfortunately, I cannot offer you that. This story is more a compilation of my own of lots of things I read and researched in order to understand the topic. It’s an approximation of how would I have like this topic to be explained to me. And always speaking in the context of Machine Learning. So also, if you’re looking for a more generic, non-biased explanation about all the involved concepts, I’m sorry again, but that’s not what this article pretends to be. Having said that, if you’re still interested in keep reading, go ahead! Hopefully, the next lines will be as useful for you, as it has been to myself!

# Why even worry about hypothesis testing?

Suppose you are working on a machine learning project, for which you want to predict if a set of patients have or not a mortal disease, based on several features on your dataset as blood pressure, heart rate, pulse and others.

Sounds like a serious project, for which you’ll need to really trust your model and predictions, right? That’s why you got hundreds of samples, that your local hospital very gently allowed you to collect, given the importance and the seriousness of the topic. But how do you know if your sample is representative of the whole population? And how can we know how much difference might be reasonable? For example, assume that thanks to some previous studies we know that the real probability for any given patient of not having this particular disease is 99%. Now suppose that our sample says that 95% of the patients don’t have the disease. Well, 4% difference doesn’t sound like a significant difference that may lead us to SUCH bad modelling, right? It might not be the same, but it kind of sounds like it may be representative. To confirm this, we need to build a better understanding of the theoretical background.

Let’s start by what we know…the real probability of not having the disease:

P (not having the disease) = 99%

Now let’s assume that we find a new group of 100 people and we test all of them to check if any has this disease we’re studying. Can we be sure that 99 of these folks won’t have the disease? Maybe, but there’s also a possibility that none of them has the disease, or even that several may have it. What we have here, is a binomial probability problem. The objective of this story is not to talk about probabilities, however, in simple words, the binomial probability is no more than a given chance of something happening a fixed number of times, given a prior probability for each independent event. We can find it by just applying the following equation:

Where:

- n = the number of trials (or the number being sampled)
- x = the number of successes desired
- p = probability of getting a success in one trial
- q = 1 — p = the probability of getting a failure in one trial

So if we want to know what is the probability that in our sample of 100 guys, we don’t have any of them infected with the disease, we may just fill in the blanks to find that the probability is of 36.6%. And if we want to know the probability of having 99% folks NOT infected, we fill in the blanks again, to find out it is approximately 37.0%. And this sounds reasonable: getting 100 out of 100 not infected doesn’t sound very unlikely if every single case has 99% of not being infected. And in this line, it also sounds reasonable that having 99 not infected out 100 folks might be a little more likely.

We could keep going and find the probability of having even less people not infected in our sample of 100 people:

- P (not infected 98 out of 100) = 18.5%
- P (not infected 97 out of 100) = 6.0%
- P (not infected 96 out of 100) = 1.5%
- P (not infected 95 out of 100) = 0.3%

Now, let’s go back to the sample we had from our friendly local hospital, which says that 95% of the guys in our sample are NOT infected by this horrible mortal disease. Well, even though it might sound like the difference in between 95% and 99% is not relevant, given that we would not be working with a random sample of folks, but instead these guys belong to the same population that we know has a 99% probability of not being infected, we’d be setting a hypothesis that our sample is representative when in reality we would have only 0.3% chance of obtaining a sample with 95 out of 100 people not infected. Therefore we should reject our hypothesis and not proceed.

# How do I apply this in my Machine Learning project?

A representative sample is one which is drawn without bias from the population of interest. Representativeness is not so much about the sample size but depending on the right composition of the sample. In the 1960s, A.C. Nielsen Jr. gave an interesting answer to those, who believed that a higher sample size would increase its representativeness.

*“If you don’t believe in random sampling, the next time you have a blood test, tell the doctor to take it all.” — A.C. Nielsen Jr.*

In fact, a sample of just 30 or more experimental units are often considered as enough (if this sounds crazy few, surely Google will be able to double check this for you). And if that sample is well done, respecting the different segments and quotas within the population, our 30+ units sample should be indeed representative. However, if you want to be completely sure about that, the technique for checking this representativeness is called **hypothesis testing**. Hypothesis testing is used to compare two datasets. It is a statistical inference method so, at the end of the test, we’ll get to a conclusion about if there’s a difference between the groups we’re comparing.

Now the bad news: even though in real life sometimes you can generate some base statistics or ground knowledge out of a census or some previous research as to use hypothesis testing, most of the times, we won’t have data such as the mean and standard deviation of our population. So you’ll have to work hard to be as sure as possible that our sample is representative of the population.

Nonetheless, whether we can or we cannot use this technique to assure the representativeness of our sample, in the context of machine learning, we can use hypothesis testing to check if our test group is representative of our train data. For that, there are some key concepts we should know:

- The “null hypothesis”
- The “alternative hypothesis”
- The z-statistics or t-statistics concept
- The p-value concept

To explain these concepts I’m going to use an example from my Immersive Course in Data Science of General Assembly: say we are testing the efficacy of a new drug, so we randomly select 50 people to be in the placebo control group and 50 people to receive the treatment. We know that our both groups were selected from a broader, unknown population pool, but it could be possible that we could have ended up with a different random sample of subjects from the population pool.

In this context, the **null hypothesis (H0)** is a fundamental concept of frequentist statistical tests. We typically denote the null hypothesis with H0. In the context of experiments, we often talk about the “control” group and the “experimental” or “treatment” group. In our drug example, the control group is the group given the placebo and the treatment group will be given the actual drug. We can then define our null hypothesis to be that there is no difference between a subject taking a placebo and the treatment drug. Measured by the average difference in blood pressure levels between the treatment and control groups. Therefore, our null hypothesis would be:

**H0: The mean difference between treatment and control groups is zero.**

Meanwhile, **the alternative hypothesis (H1) **is the outcome of the experiment that we hope to show. In our example, the alternative hypothesis is that there is a mean difference in blood pressure between groups:

**H1: The difference in systolic blood pressure between groups is not 0.**

Say now that in our experiment we measure the following results:

- Control group: 50 subjects and average systolic blood pressure of 121.38
- Experimental/treatment group: also 50 subject and average systolic blood pressure of 111.56

The difference between experimental and control groups is -9.82 points. But with 50 subjects in each group, how confident can we be that this measured difference is real? Now is the real fun begins (nerd kind of fun, of course): to find this we’ll have to use statistics to find a measure of the degree to which our groups differ or not. To do that there are several different types of tests we can use within the world of statistics. In this post we’re going to talk about Z-Tests and T-Tests. The main differences in between these two kinds are:

In our case, our sample sizes are equal, larger than 30, we can assume normality, they were randomly selected from a broader population and they are indeed independent of each other, so we can proceed with a Z-Test.

Following our experiment example, to test our hypotheses we’ll have to rely on the Central Limit Theorem. Let’s refresh this concept: the Central Limit Theorem establishes that if we gather together a set of independent random variables (for example, the mean of independent groups, as our experimental and control groups) they should follow a normal distribution even if the original variables themselves are not normally distributed. Suppose then that a sample is obtained containing a large number of observations, each observation being randomly generated in a way that does not depend on the values of the other observations, and that the arithmetic mean of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the distribution of the average will be closely approximated by a normal distribution. Back to our experiment, we don’t know whether or not the systolic blood pressure is normally distributed, but since the size of our sample is larger than 30, we can assume normality. So this would be our picture taking the mean of both groups:

In this context, we want to figure out what’s the probability of finding a result at least as low as 111.56. In other words: what’s the probability of having a difference of -9.82 in between means. And to know that we’ll check how many standard deviations from the mean we are. And how do we do that? We’ll find what is known as a Z-Value.

How to find this value depends on what we’re comparing. Suppose you’re working on a regression problem and you need to create a test group. You’d like to know if your test group is representative of your dataset, right? You could check that by finding the Z-Value, comparing the mean of your test group with the mean of your train data and dividing the result by the standard deviation of the sampling distribution:

This would give us how many standard deviations we are from the mean. Now, it’s true that usually we don’t know our sampling standard deviation. That’s why, thanks again to the Central Limit Theorem -given that we have a 30+ sample size, we can find our sampling distribution as follows:

Being σ the standard deviation of our population. Nonetheless, it’s also true that more often than not, we won’t know either the standard deviation of the population. However, when working with samples, it’s usual that to work with what is known as a point estimate. In this case, our sample standard deviations it’s going to be a point estimate of our population standard deviation. In statistics, point estimation involves the use of sample data to calculate a single value which serves us as a “best guess” or “best estimate” of an unknown population parameter. For example:

- The sample mean is a point estimate of the population mean
- The sample variance is a point estimate of the population variance

Replacing then our original equation we get then:

So as we already saw, in this case we can proceed and find our Z-Value. However, sometimes the conditions will not be met and we won’t be able to find it. In these cases we’ll have to find what is known a T-Value. What’s the difference between them? Absolutely nothing. The only difference will be what we’ll do with these values.

Mind that till now, we have the probability of nothing…we just have how many standard deviations away from the mean our test group mean is. We could find this probability, but in real life, what most people do is check what is called a t or z-table.

Let’s see first how to use a z-table:

This table can be easily found online. To use it, we need to take the Z-Value we previously found and look it up through both the horizontal and vertical axis. For example, if we had a Z-Value of 0.23, we would have a probability of 0.5910 of finding a mean at least as extreme as the one of our test group:

This probability is what is known as the p-value. The following chart is often used as a good reference for this:

Now if our dataset doesn’t meet the requirements (our data size is smaller than 30 sample units, the data is not normally distributed and/or any of the other conditions we saw) for us to use a z-table we could use a t-table:

In this table we’re going to search the T-Value within the table instead and we’ll also need our sample size. Take the following example:

In this case, with a sample size equal to 5 and a Z-Value of 2.75, we would have a tail probability of 0.02. This means that our test group mean would have a probability of 2% of being at least AS HIGH. Mind that this table gives us the TAIL probability, which means we would also have a probability of 2% of being at least AS LOW as -2.75 standard deviations from the mean. And all in all then, a probability of 96% of having a mean in between 2.75 and-2.75 standard deviations away from our mean.

An important clarification is that the equations we saw above apply only to when we are comparing a certain value with our sample mean. If we wanted to compare the mean of two different samples or datasets, the equation to find our z or T-Value would be different. The following table gives a good reference of what to use in each case:

Amazing, we’re almost there! Now we got a solid ground about how this whole hypothesis testing thing works, let’s get back to our example. The last thing we saw was:

In our example, we’ll be working with the difference between two means and using the following equation:

Seems like a messy formula? Don’t worry, Python is here to save us. We can easily do this using the statsmodel library in the following way:

from statsmodels.stats import weightstats as stestsztest, pval = stests.ztest(experimental, control, value=0, alternative=’two-sided’)

And look what I’m giving to the function:

- both groups to be tested
- value = 0: since I’m expecting a difference of zero in between their means
- Alternative = ‘two-sided’: since I want to know the probability of finding a difference at least as low as -9.82, but also at least as high as +9.82

We can now print our results in the following way:

print(ztest)-1.8915462966190273print(pval)0.05855145698013669

This is telling us we have a probability of 5.855% of finding a value at least as EXTREME as 9.82. Either positive or negative. Meaning that in this case each of our tails has a probability of 2.9275%.

In a very similar way, if our sample sizes were smaller than 30, we should proceed with a T-Test instead, using the following formula:

And for this kind of test, the library we should use would be stats from scipy:

import scipy.stats as statst_test_result = stats.ttest_ind(experimental, control)

In a similar way as before, we can get our metrics by calling:

t_test_result.statistic-1.8915462966190273t_test_result.pvalue0.061504240672530394

Mind how as we said before, both our t_test_result.statistic from scipy and our ztest from statsmodel gave us the same number: -1.89.

We could also plot our t distribution, centred on zero, with a vertical line on our measured t-statistic the following way:

# generate points on the x axis between -4 and 4:xpoints = np.linspace(-4, 4, 500)# use stats.t.pdf to geT-Values on the probability density function for the t-distribution# the second argument is the degrees of freedom: n1 + n2–2ypoints = stats.t.pdf(xpoints, (50+50–2), 0, 1)# initialize a matplotlib “figure”fig, ax = plt.subplots(figsize=(8, 5))# plot the lines using matplotlib’s plot function:ax.plot(xpoints, ypoints, linewidth=3, color=’darkred’)# plot a vertical line for our t-statisticax.axvline(t_stat, color=’black’, linestyle=’ — ‘, lw=5)plt.show()

Wrapping up: we have measured a difference in blood pressure of -9.82 between the experimental and control groups. We then calculated a t-statistic associated with this difference of -1.89. This means that our two-tailed P value is 0.0615, meaning there’s a 6.15% chance of finding a mean at least as extreme as the one of our test group, given there is a 0.0 true difference in blood pressure between experimental and control conditions. However, our null hypothesis states there is no difference between groups. A t-statistic for no difference between the groups would be 0. Recall that our alternative hypothesis is that the difference between the groups is not 0. This could mean the difference is greater than or less than zero — we have not specified which one. This is known as a 2-tailed t-test and is the one we are currently conducting. So we should add another line to the right of our mean:

# Good versus bad

Now that we understand more about how to perform a hypothesis test, there’s only one thing left to see: once we have our p-value, what is a good or bad p-value? Here’s when the concept of confidence levels jumps in, being the probability that the value of a parameter falls within a specified range of values. Pay attention to the words ‘range of values’: they make reference to the confidence interval. Which, according to Wikipedia, is a type of interval estimate, computed from the statistics of some observed data. The interval has an associated confidence level that, loosely speaking, quantifies the level of confidence that the parameter lies within the interval.

One important thing about the confidence level is that we need to define it prior to examining the data. Otherwise, we would be cheating :). Most commonly, the 95% confidence level is used. However, other confidence levels can be used, for example, 90% and 99%. Take the following image as a reference for the Z-Value associated with each confidence level:

As we have seen before, for any given Z-Value we’ll have a certain tail probability. If we have to take one or both tails, it depends on our hypothesis.

# Hypothesis testing process in a nutshell

We could wrap up the entire process in the following steps:

- Define hypothesis
- Set confidence level
- Calculate point estimate
- Calculate test statistic
- Find the p-value
- Interpret results

# How else can we apply all this?

Testing our train group versus our test group is not the only thing we’d like to do with hypothesis testing in machine learning. Let’s see a few more use cases:

- As we said before, if somehow we know the mean of our population, we could run a proper test to know if we have a representative sample
- Also, working in a classification problem, we could check if our feature vectors are meaningful distinguishable in between our different targets as for our machine learning model to identify them. To know if that’s the case we could then take the individual vector for each or some of the features, and compare in between the different categories of our target to confirm if there’s a significant difference between them.
- In a website A/B testing, once the different versions of it have been running for a while, we could compare whether there’s a significant difference between their performance. For example, taking one specific metric, as the average click-through rate (CTR) and comparing in between the different versions of the website tested.
- We may also want to check if we may have a symmetrical distribution for any of our features, even our target variable, running a test to check if mode, mean and median are the same. When the values of mean, median and mode are not equal, then the distribution is said to be asymmetrical or skewed. We can do this be finding our t-statistic in the following way:
*t_statistic = (sample_mean — sample_median)/(sample_std/sample_size**0.5)*

Well, I think that’s more than enough for now. If you enjoy this story don’t forget to check out some of my last articles, like 10 tips to improve your plotting skills, 6 amateur mistakes I’ve made working with train-test splits or Web scraping in 5 minutes. All of them and more available within my Medium profile. Also, **if you want to receive my latest articles directly on your email, just ****subscribe to my newsletter**** :)**

Get in touch also by…

- LinkedIn: https://www.linkedin.com/in/gferreirovolpi/
- GitHub: https://github.com/gonzaferreiro (where all my code’s available)

Thanks for reading!