Pearson’s chi-squared test from scratch with Python

Tobias Roeschl
Analytics Vidhya
Published in
6 min readFeb 22, 2020

--

After having discussed Fisher‘s exact test and its implementation with Python in my last article, I now want to dedicate myself to another hypothesis test named after another famous statistician: Pearson‘s chi-squared test.

Once again, I first want to introduce some theoretical aspects of the test before we conduct the test in Python from scratch. At the end, we want to compare our test result to the result we get with Python‘s built-in function.

Pearson’s chi-squared test is a hypothesis test which is used to determine whether there is a significant association between two categorical variables in a contingency table. The null hypothesis states that the two variables are statistically independent. The null hypothesis is tested by calculating the probability of obtaining values as discrepant or more discrepant from the values we expected to obtain in our contingency table — under the assumption of independence. Therefore, we have to determine the expected value for each cell in our contingency table.

If two random variables, A and B, are stochastically independent, then their joint probability is equal to the product of their marginal probabilities:

In a contingency table we can equivalently calculate the expected count by multiplying the column total with the row total and dividing by the grand total for each cell respectively. To get a measurement of “discrepancy” we take the squared differences between the observed values and the expected…

--

--

Tobias Roeschl
Analytics Vidhya

Resident physician passionate about data science, statistics and machine learning. Get in touch: www.linkedin.com/in/tobi-roeschl-60430217b