Comprehensive Guide to Chi-Square -Tests For Independence

Kirti Bihade
Analytics Vidhya
Published in
6 min readAug 23, 2019
Photo by Lukas Blazek on Unsplash

Human nature is always trying to find out the relationship and dependencies between different parameters around.

Here are some examples where we may want to know the answers if there are dependencies:

We always want some sort of proof before coming to certain conclusions/inferences. In such scenarios, we can take help of inferential statistics. We can take help of statistical analysis to reach the conclusions /inferences to questions like above.

Let me present Chi-Square tests for independence — statistical tests to find out the relationship between categorical variables.

Let’s quickly understand what are categorical variables:

Imagine the data related to houses. People may let their houses for free in charity or may be rented out or maybe own it and stay in it. So, the three FREE | OWNED | RENTED are classified as categories. So the HousingType is a category variable.

Imagine data related to gender. Gender is a categorical variable having categories as MALE | FEMALE

However, the math score is a numeric variable. It is derived value — it’s a numeric value.
When we want to find out dependency between the variables using Chi-Square tests, all the parameters should be categorical.

This article is in sync with my Kaggle kernel, please refer to the kernel as and when required

https://www.kaggle.com/kirtibihade/chi-square-on-student-performance

Problem Statement and Solution
Idea is to find whether there is a dependency between gender and math score.

How to convert math (numeric variable) score into categorical?

We can divide the score into 4 categories using grade.

Let us consider the following:

Can we make inference on a handful of data? …No

May be the females in one class are more competent than the males and in other class, it may be a different scenario. So, to operate in an unbiased mode J — let’s imagine we have a dataset consisting of data randomly picked from the population of 1000 students.

It will make sense to apply statistical analysis on the data. So let’s start by applying one statistical test of independence which is Chi-square test:

I have taken a dataset of student performance from Kaggle. Dataset is having 1000 entries with different parameters. For this article, we are considering only the gender and math score.

Let’s start at……..

1 — Chi-square test is used for:

  • Test the goodness of fit
  • Test the significance of the association between two attributes or test of independence
  • Test the homogeneity or the significance of population variance

We will focus on testing the significance of the association between two attributes or test of independence. Consider student performance data used in my Kaggle notebook.

2 — Categorical variables in that data set are

2.1 — Gender

  • Male
  • Female

2.2 — Race /ethnicity

  • Group A
  • Group B
  • Group C
  • Group D
  • Group E

2.3 — Parental level of education

  • Bachelor’s degree
  • Some college
  • Master’s degree
  • Associate’s degree
  • High school
  • Some high school

2.4 — Lunch

  • Free/reduced
  • Standard

2.5 — Test preparation course

  • None
  • Completed

3 — Numerical variables in the dataset are

  • Math score
  • Writing score
  • Reading score

We can convert a numerical variable into categorical variables.
We can separate the data into bins such as

  • Low
  • Medium
  • High
  • Excellent

The math score can now be treated as categorical. Similarly, we can convert the writing score, reading score and so on.

4 — How a Chi-Square test helps us to find dependency between Gender and Math score:
Before solving the problem, we have to make certain assumptions/hypothesis

There are two possibilities -

1. Gender and Math score is dependent or
2. Gender and Math score is independent

Terminology for this in Chi-Square tests

Null hypothesis H0 — Math score and gender are independent

Alternate hypothesis HA — Math score and gender are dependent

4.1 — Chi-Square test will help us to reject or accept the null hypothesis.
How the chi-square test will decide to accept or reject the null hypothesis?

Following are the real/observed numbers from the dataset. This table is a contingency table.

4.1.1 — Observed Frequency Table :

To calculate chi square value

O — Observed frequency
E — Expected frequency

For finding out the expected frequency for each value of observed frequency
Expected frequency for Row1 and Column1 = (Row1 total)*(Col1 Total)/Grand Total

4.2.2 — Calculations are as follows for given contingency table

  • Row1 total = 518
  • Row2 total = 482
  • Grand Total = 1000
  • Col1 Total = 274
  • Col2 Total = 576
  • Col3 = 7
  • Col4 = 143
  • Grand Total = 1000

Grand total for both row and column will be same.

4.2.3 — Expected frequency Table

Example: For the first observed value = 112 which is in Row 1 and Column 1

Expected value is

  • Row 1 and Column 1 = (R1 total)(C1 Total)/Grand Total = (518)(274)/1000 = 141.932
  • Row 1 and Column 2 = (R1 total)(C2 Total)/Grand Total = (518)(576)/1000 = 298.368 and so on……..

4.2.4 — Calculated chi square value is = [(112 –141.932)^2/141.932] + [(309–298.368)^2/298.268] + [(7–3.626)^2/3.626] + [(90–74.074)^2/74.074] + [(162–132.068)^2/132.068] +[(267–277.632)^2/277.632] + [(0–3.374)^2/3.374] + [(53–68.926)^2/68.926]
=
6.31 + 0.37 + 3.13 + 3.42 + 6.78 + 0.407 + 3.374 + 3.67 = 27.461

4.2.5 — Calculated Chi square value (27.461) > table value (16.26) for 0.001 significance level and 3 Degrees of freedom

5 — How to decide significance level or alpha value?

  • Researchers can decide the value of significance level whether to consider 0.05,0.01 0.5 or 0.01. You can refer to the table.

Basically, the significance level is a measure of how certain we want t about our results — low significance level corresponding to a low probability that the experimental results happened by chance, and vice versa.

  • Degree of freedom calculated as (number of rows -1)(number of columns- 1) = (2–1)(4–1) = 3
  • Degree of freedom = 3 for given contingency table.
  • 0.01 significance level means the confidence level is 99.99%

6 — Check for test

According to the Chi-Square test of independence if the calculated value is greater than the table value, reject the null hypothesis.

Our null hypothesis was — Math score and gender are independent, but according to the table and calculated Chi-square value null hypothesis is rejected and we can say that:

Math score and gender is dependent.

This is how we can check dependency/independency between different categorical data using Chi-square tests.

Please check Kaggle kernel for rest of the variables. You can try for another dataset.

Kirti Parag Bihade

ML /DL Engineer @Infogen Labs

--

--