Handling The Chi Square in Statistics with Python

Nhan Tran
4 min readMar 11, 2019

--

In this post, we will show you how to calculate chi square value using Python.

Before going to the programming part, please spend couple minutes to read my previous post about The Chi Square Statistic for your reference.

At first, you can choice between many different programming languages for your work such as Python, R, Java, Scala…etc. In this post, I will use Python because it’s the most popular PL, easy to learn, easy to use, has a lot of supported libraries and of course, I’m quite familiar with Python than others.

Now, take a look on a sample contingency table that we will work today. This is a dataset that contains total 1000 votes from different races (Asian, Black, Hispanic, White, and other) and parties (Democrat, Independent, and Republican):

Methodology 01: manual calculation

First, you need to import 3 basic libraries that support you to process dataset

Step 01: Create sample data

In this sample, I will use the random function from numpy with seed of 10:

then create a Crosstab from previous DataFrame, assign names for columns and rows:

Step 02: Create Observed table and Expected table:

Observed table can be extracted from our Crosstab by exclude the row_totals and col_totals. You can see row_totals is in the index of 4 (in column) and col_totals is in the index of 6 (in row).
So iloc[0:5, 0:3] in the below code snippet means “we will take the rows from 0 index to 5 index and columns from 0 index to 3 index and assign to new Crosstab that named [observed]”

…after calling observed variable, you can see your contingency table looks like below:

observed table

Expected table can be calculated using below formula:

…now take a look back at our code and see what we have:

  • total_rows = voter_tab[“row_totals”]
  • total_columns = voter_tab[“col_totals”]
  • total_observations = 1000

Alright, now is the code to calculate expected table:

* Please note that the loc function in below code is used to switch the index base on column name to row name

And then convert expected table into DataFrame, assign names to columns and rows:

…after calling expected variable, you can see your contingency table looks like below:

expected table

Step 03: Calculate the Chi-Square value and Critical value:

Chi square (x²) formula

…after calling print(chi_squared_stat), the output should be:

value of chi_squared_stat

* Note: We call .sum() twice: once to get the column sums and a second time to add the column sums together, returning the sum of the entire 2D table.

Critical value can be calculate using stats library:

…after calling print(crit), the output should be:

value of crit

* Note: We expect the probability level should be 5% (equivalent to 0.95) and degree of freedom is 8 which can be calculate using this formula (total rows — 1) x (total columns — 1)

Now we can give the final conclusion that races and choices are independent because chi-square < critical value (7.16 < 15.51)

Methodology 02: calculate using scipy.stats library

First, we will use the result from step 1 and 2 to get the observed table before applying below code snippet:

After printing out the value of stats, result should look like:

value of stats

This is an array includes chi_squared_stat, p_value, df and expected_crosstab

You can find the complete source code as follows:

--

--