Hypothesis Testing in Python Part 3: Proportion Tests and Chi-Square Analysis

JABERI Mohamed Habib
3 min readJust now

--

Photo by Boitumelo on Unsplash

Hypothesis testing is a fundamental aspect of statistical analysis, enabling us to make inferences about populations based on sample data. In this article, we’ll dive into two crucial types of tests for proportions: one-sample and two-sample proportion tests. We’ll also explore the chi-square tests, including the chi-square test of independence and the chi-square goodness of fit test. Finally, we’ll demonstrate how to perform these tests using Python. You can follow along with this notebook on Kaggle for a hands-on experience

One-Sample Proportion Tests

What is a Proportion?

A proportion represents the fraction of the total that possesses a particular attribute. For instance, if we want to know the proportion of voters who support a specific candidate, the proportion is the number of supporters divided by the total number of voters surveyed.

Test for Single Proportions

The one-sample proportion test assesses whether the proportion of a single sample differs from a known or hypothesized population proportion. This test is essential when you want to compare your sample proportion to a standard or expected proportion.

Python Implementation:

from statsmodels.stats.proportion import proportions_ztest

# Sample data
count = 30 # Number of successes
nobs = 100 # Number of observations
value = 0.3 # Hypothesized proportion

stat, p_value = proportions_ztest(count, nobs, value)
print(f'Statistic: {stat}, p-value: {p_value}')

Two-Sample Proportion Tests

Test of Two Proportions

The two-sample proportion test compares the proportions of two independent samples to determine if there is a significant difference between them. This is useful in scenarios such as comparing the success rates of two different treatments or the preferences of two different groups.

Python Implementation:

# Sample data
count = [30, 40] # Number of successes in both samples
nobs = [100, 120] # Number of observations in both samples

stat, p_value = proportions_ztest(count, nobs)
print(f'Statistic: {stat}, p-value: {p_value}')

Chi-Square Test of Independence

The chi-square test of independence evaluates whether two categorical variables are independent. It is commonly used in contingency tables where the frequencies of different categories are compared.

The Chi-Square Distribution

The chi-square distribution is a theoretical distribution that is used to determine the critical value for the test. The shape of the chi-square distribution depends on the degrees of freedom, which is related to the number of categories being compared.

How Many Tails for Chi-Square Tests?

Chi-square tests are typically one-tailed tests because we are usually interested in whether the observed frequencies deviate significantly from the expected frequencies in one direction.

Performing a Chi-Square Test

Python Implementation:

import pandas as pd
from scipy.stats import chi2_contingency

# Contingency table
data = [[10, 20, 30], [6, 9, 17]]
table = pd.DataFrame(data)

stat, p_value, dof, expected = chi2_contingency(table)
print(f'Statistic: {stat}, p-value: {p_value}, dof: {dof}')
print(f'Expected frequencies: \n{expected}')

Chi-Square Goodness of Fit Tests

The chi-square goodness of fit test checks whether an observed frequency distribution differs from a theoretical distribution. This test is useful when you want to see if your data fits a particular distribution, such as a normal distribution or a custom probability distribution.

Visualizing Goodness of Fit

Visualization helps to understand how well the observed data matches the expected distribution. Using bar charts or histograms can make this comparison more intuitive.

Performing a Goodness of Fit Test

Python Implementation:

import numpy as np
from scipy.stats import chisquare

# Observed data
observed = np.array([10, 20, 30])
# Expected data
expected = np.array([15, 15, 30])

stat, p_value = chisquare(observed, expected)
print(f'Statistic: {stat}, p-value: {p_value}')

Conclusion

Hypothesis testing for proportions and chi-square tests are powerful tools for analyzing categorical data. Whether you’re comparing proportions or examining the independence of variables, Python provides robust libraries to perform these tests efficiently. By leveraging these statistical methods, you can draw meaningful conclusions and insights from your data.

Feel free to experiment with the provided notebook on Kaggle to understand the concepts better and apply them to your datasets. Happy analyzing!

Check the next part, Part 4, to continue your learning journey.

Appreciation for Exploring My Work

Thank you for taking the time to explore my work. If you found it valuable, please consider showing your support through an upvote or leaving a comment/feedback to help improve the notebook.

Contact

LinkedIn | GitHub | Kaggle | DataCamp

--

--

JABERI Mohamed Habib

iOS App Developer with 8+ years' experience. Expert in Swift, Xcode, and tech writing.