Everything You Need to Know About T-Tests (With Python Code)

Madeleine Smithers
6 min readJun 16, 2023

--

What is a t-test? A t-test is one of the most common ways to test a hypothesis, and we use it to test two means against each other. We run a t-test when we have one or two samples, we don’t know the larger population standard deviation, and we have a relatively small sample size (~n<30).

What are we looking for with these tests? To explain it a little more precisely than simply a statistically significant difference, what we are trying to assess is how likely it is that the difference between means is simply due to the randomness of drawing a sample. This is why we set a confidence level, or our alpha. Our alpha (often .05), means that there is a 5% chance of a type 1 error occurring, also known as a false positive, or that we have a 95% confidence level. Another way to put this is that if we drew a sample from our population one hundred times, ninety-five of those samples would not contain our observed mean. So it is very unlikely that our observed results are due to random sampling, and there is likely something more to investigate.

There are a few different kinds of t-tests:

  1. One-Sample t-test

This kind of t-test compares the observed mean of a sample group to a known or hypothesized population mean. Effectively, you are working with only one sample group. An example of a one-sample t-test might be if a school wanted to test whether their students’ PSAT scores were statistically higher than the known or estimated national average.

  1. Two-Sample t-test (independent)

An independent two-sample t-test compares the means of two sample groups against each other to assess whether there is a statistically signficant difference. An example of this would be comparing the mean size of strawberries from a sample from California to the mean size of strawberries from a sample from Oregon.

  1. Two-Sample t-test (paired)

A paired Two-Sample t-test compares the mean of two related samples to test for a significant difference. In this case, the sample subjects are the same, and are being compared under different conditions. This kind of t-test would be used to compare subjects before and after a drug treatment, for example.

Regardless of the type of t-test you are performing, there are 5 main steps to executing them:

(assuming that you already know your sample is roughly normal, and your variable is numeric and continuous)

  1. Set up null and alternative hypotheses
  2. Choose a significance level (alpha)
  3. Calculate the test statistic (t-value)
  4. Determine the critical t-value (find the rejection region)
  5. Compare t-value with critical t-value to determine if we can reject the null hypothesis.

The alpha is the probability of rejecting the null hypothesis when it is true (getting a false positive) For example, a significance level of .05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. (ie you got a sample that was highly unlikely to have been drawn, as shown in the tail ends of the curve) The shaded ends of the curves are called the critical regions for a two-tailed test. The critical region defines how far away our sample statistic must be from the null hypothesis.

image from Statistics by Jim

Technically, you could determine your results just from looking at a graph like this. The graph below would show a false positive, because we landed in the 5% chance that we would get a sample mean. However, it’s more common to use statistical software to compute a p-value, then compare it to the stated significance level. If your p-value is less than your alpha (significance level), then you can reject your null hypothesis.

Let’s take a look at how to run t-tests with Python:

First, here’s an example of a one-sample t-test run in python using a randomly generated sample:

import numpy as np
import scipy.stats as stats

# Set a random seed for reproducibility
np.random.seed(42)

# Generate example data, with a mean of 10, a standard deviation of 2, and size of 100
data = np.random.normal(loc=10, scale=2, size=100)

# Given population mean (also known as the null mean)
pop_mean = 9.5

#run our ttest code
results = t_statistic, p_value = stats.ttest_1samp(data, pop_mean)
print(results.pvalue, results.statistic)

#remember that ttest_1samp assumes you are running a two-sided test. If running a one-sided test, use the given p-value/2

The output of the above block of code gives us:

t-statistic = 0.110730399982307, and p-value = 1.609321334000954

This means that we cannot reject the null, since our p-value is greater than our alpha of .05.

If we wanted to calculate our t-statistic by hand, we could do so as well:

#If you wanted to calculate the t-statistic by hand 
x_bar = np.mean(data)
mean_diff = x_bar - pop_mean
s = np.std(data, ddof=1)
n = 100
t_stat_hand = mean_diff / (s/np.sqrt(n))
t_stat_hand

It also helps to visualize our t-statistic and t-critical to understand what’s happening:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Set the degrees of freedom for the t-distribution
df = 99

# Set the critical t-value and t-statistic
t_critical = 1.98
t_statistic = 1.60

# Generate x-values for the t-distribution
x = np.linspace(-4, 4, 500)

# Compute the probability density function (PDF) of the t-distribution
y = stats.t.pdf(x, df)

points = np.array(zip(x, y))

# Create the plot
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(8, 6))

# Plot the t-distribution
sns.lineplot(x=x, y=y, color="b", label="t-distribution")

# Add vertical lines for t_critical and t_statistic
ax.axvline(x=t_critical, color="r", linestyle="--", label="t_critical = 1.98")
ax.axvline(x=t_statistic, color="g", linestyle="--", label="t_statistic = 1.60")

# Set plot title and labels
plt.title("T-Distribution with Critical Value and Test Statistic")
plt.xlabel("T-Score")
plt.ylabel("Probability Density")

# Add legend
plt.legend()

# Show the plot
plt.show()

Now, you may also want to calculate a confidence interval. The first thing we need to do is calculate the standard error. We can either do this by hand, or by using the scipy.stats package.

#find the standard error by hand
se = s/np.sqrt(n)

#using stats:
stats.sem(data)

Next, we can calculate the interval either by hand or with another stats function:

#by hand
moe = t_critical * se
moe
ci = (x_bar - moe, x_bar + moe)

#calculate with scipy.stats
stats.t.interval(
confidence = 0.95,
loc = x_bar,
scale = stats.sem(data),
df = n-1,)

Both methods will give you the same answer: (9.431906327276199, 10.152707603147427)

What does a margin of error of approx. (9.43–10.15) mean? It means that if we repeated this 100 times, then 95 of the CIs should include the real mean. Or, there is a 5% chance that the real mean is not in this sample.

Running two-sample t-tests are almost identical using python, with only a few small changes to the code.

An example two-sample t-test (independent) with random samples:

# Set a random seed for reproducibility
np.random.seed(42)

# Generate two example data sets
data1 = np.random.normal(loc=10, scale=2, size=100)
data2 = np.random.normal(loc=12, scale=2, size=120)

# Perform independent two-sample t-test
t_statistic, p_value = stats.ttest_ind(data1, data2)

This will return your t_statistic and p-value, as as our previous example did.

In some cases, you might not have the entire sample, but you have the sample statistics. We can still run a t-test “from stats” with the below code:

#run an independent two-sample test from pre-calculated stats instead of full samples:
mean1 = np.mean(data1)
std1 = np.std(data1, ddof=1)
nobs1 = len(data1)
mean2 = np.mean(data2)
std2 = np.std(data2, ddof=1)
nobs2 = len(data2)

stats.ttest_ind_from_stats(mean1, std1, nobs1, mean2, std2, nobs2)

Finally, we can also run a two-sample paired t-test with similar code:

before = np.random.normal(loc=30, scale=3, size=50)
after = np.random.normal(loc=31, scale=3, size=50)

stats.ttest_rel(before, after)

Overall, using Python to run your t-tests is much faster and effective than calculating them by hand. However it can be useful to practice both ways in order to gain a better understanding of what’s going on “under the hood.”

--

--

Madeleine Smithers

Data Science & Machine Learning | Creative Storytelling with Data & Business Analytics