Constructing heat map for Chi-square test of independence

How to build a correlation matrix type heat map for chi-square test p-values in Python

Published in

Analytics Vidhya

6 min readMar 16, 2021

Chi-square test is a very well know and famous test that is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of categorical variables. In machine learning, the Chi-square test can be used to check the association of variables among categorical variables. Based on the test results we can eliminate those variables which are not strongly associated with the response variable. Alternatively, we can also check the association of independent variables among themselves and can drop those variables which are strongly associated with each other. The concept is the same as Pearson’s correlation coefficient test, but the difference is quite evident.

Correlation vs Chi-square test

Pearson’s correlation coefficient is used to illustrate the relationship between two continuous variables, such as years of education completed and income. The Chi-square test of Independence determines whether there is an association between two categorical variables i.e. whether the variables are independent or related like for example if education level and marital status are related for all people in some country.

Why heat map for Chi-square p-values

In this article, I will not discuss the Chi-square test and its properties, there is enough material available on the Chi-square test on the internet. This article is about how can we produce a correlation matrix type heat map for the Chi-square test of independence. I came across this problem recently where I have to drop some categorical variables before feeding them in a decision tree algorithm.

However, I was unable to find any function neither in R nor in Python that can produce matrix-like heat map for Chi-square test p-values, as we get for correlation test.

However, both languages have ways to test variables association using the Chi-square test but considering the number of columns (more than 100 categorical) variables, it is cumbersome to check each variable one by one.

Do not confuse “Variable Importance” with Chi-square test

Because categorical variables fall under classification problem so most people don’t care about the Chi-square test and prefer decision trees default function of “variable importance” that is available in decision tree algorithm like Random Forest.

Please don’t confuse decision tree variable importance function with chi-square test of independence as decision tree variable importance is calculated on the basis of gini impurity at each node split. While Chi-square is a statistical test like correlation but for categorical variables.

Chi-square test prerequisites

Before running chi-square test there are some pre-requisites

Your data must meet the following requirements:

Two categorical variables.
Two or more categories (groups) for each variable.
Independence of observations.

There is no relationship between the subjects in each group.
The categorical variables are not “paired” in any way (e.g. pre-test/post-test observations).

4. Relatively large sample size.

Expected frequencies for each cell are at least 1.
Expected frequencies should be at least 5 for the majority (80%) of the cells.

Build heat map in Python

The following code can be used to build a heat map for chi-square test p-values.

For demonstration purposes, the dataset is taken from https://www.kaggle.com/aljarah/xAPI-Edu-Data?select=xAPI-Edu-Data.csv. It has 16 categorical variables and one response variable “Class”. Dataset description can be found on the above link. We are not loading all independent variables.

import pandas as pd
import numpy as np
import os 
from sklearn.feature_selection import chi2
from scipy import stats
import seaborn as sns
import matplotlib.pylab as plt# Loading file
studentdf = pd.read_csv(“xAPI-Edu-Data.csv”,low_memory=’False’) ## Extracting column names 
column_names=studentdf.columns# Assiging column names to row indexs 
chisqmatrix=pd.DataFrame(studentdf,columns=column_names,index=column_names)

Above we have constructed a matrix of n columns and n rows. This matrix is used for filling p-values of the chi-squared test.

# Setting counters to zero
outercnt=0
innercnt=0for icol in column_names: # Outer loop
 for jcol in column_names: # inner loop
 # Converting to cross tab as for CHi-square test we have
 # to first convert variables into contigency table
 mycrosstab=pd.crosstab(studentdf[icol],studentdf[jcol])
 #Getting p-value and other usefull information
 stat,p,dof,expected=stats.chi2_contingency(mycrosstab)
 # Rounding very small p-values to zero
 chisqmatrix.iloc[outercnt,innercnt]=round(p,5) # As mentioned above Expected frequencies should be at 
 # least 5 for the majority (80%) of the cells.
 # Here we are checking expected frequency of each group cntexpected=expected[expected<5].size #Getting percentage 
  perexpected=((expected.size-cntexpected)/expected.size)*100

If the expected frequency is less than 5 for the (20%) of the group of frequencies between two variables we will ignore the p-value between those two variables while inspecting the heat map visually. Here we are not ignoring variables rather we don’t trust the p-value between them due to the low count of frequency so ideally, we will keep both variables. I am assigning relatively a different numeric value over here to such variables that value is “2”.

if perexpected<20:
 chisqmatrix.iloc[outercnt,innercnt]=2 #Assigning 2
 
 if icol==jcol:
 chisqmatrix.iloc[outercnt,innercnt]=0.00
 innercnt=innercnt+1
 outercnt=outercnt+1
 innercnt=0

Above we can see a matrix of p-values based on the Chi-square test

Above we can see a correlation matrix like heat map. “Class” is a response variable.

Null and alternate hypothesis

The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square test of Independence can be expressed like below

H0: “[Variable 1] is independent of [Variable 2]”
H1: “[Variable 1] is not independent of [Variable 2]”

We are using α = 0.05, that would be 95% confidence interval. Based on above heat map we can conclude following inference

Inferences based on p-values

Since the p-value between “gender” and “relation” is less than our chosen significance level of (α = 0.05), we can reject the null hypothesis. We can conclude that there is enough evidence to suggest an association between gender and relation.
Since the p-value between “secionId” and “class” is greater than our chosen significance level of (α = 0.05), we can not reject the null hypothesis. We can conclude that there is NOT enough evidence to suggest an association between gender and relation.

To conclude based on the above heat map we can exclude “StageId” and “SectionId” from the final list of variables as they show no significance with the response variable.

How to avoid situation when expected frequency is less than 20%

The value “2” above is assigned to those variables where the expected frequency is less than 20% so we can not make any decision about those variables, to be on the safe side we can keep them. Such a situation occurs when there are too many levels within data. One workaround to avoid this situation is clubbing levels by combining different levels within the same category variable. Like “Raisedhands” have a value between 1 and 100 so we can reduce it to 10 different categories by assigning 10 values to each category.
The code mentioned in this article is not optimized. One can improve and customize the code according to his/her requirements.