Chi Square & Fisher Exact in R

Rahardito Dio Prastowo
4 min readJan 31, 2022

--

If descriptive statistic summarize the characteristics of a data set, inferential statistic give you conclusions and predictions based on your data

An example of an inferential statistical method is chi-squared and fisher exact, the chi-squared and fisher exact tells you how much difference exists between your observed counts and the counts you would expect on the basis of frequency comparison. In short, a chi-square is a method to show a relationship between two categorical variables or a method to decide if two variables might be related or not

for example :
is there any difference in the income value based on education level ?

Chi square is used as a comparative test only if the expected frequency > 5, then can use fisher exact if the expected frequency < 5

Hypothesis Testing :
Ho: there are no relationships between the categorical variables. If you know the value of one variable, it does not help you predict the value of another variable.

Ha: There are relationships between the categorical variables. Knowing the value of one variable does help you predict the value of another variable.

When your p-value is less than or equal to your significance level (0.05), you reject the null hypothesis

Case Study

In this case, we will use the employee dataset and by using this dataset we can simulate the chi square test and also the fisher exact test

Suppose you work in the HR department with about 700 employees, your superior asked you to analyze the factors that caused employees to leave the company. In this case, your team have to know whether the employee is resign or not based on the gender category by using chi square test

# load dataset
dataset <- read.csv("Employee.csv", stringsAsFactors = T, sep=";")
# show table
View(dataset)
# show the descriptive statistics
summary(dataset)

There are 5 variables in the dataset, but we will only use status and gender variables for chi square test

# CHI SQUARE TEST
# create table of test variable
dataset2 <- table(dataset$Status,dataset$Gender)
dataset2

Based on the table, there are more female employees who resigned than female employees who stayed, and more male employees who stayed than male employees who resigned.
So it is estimated that gender affects the condition of being resign or not. However, to make sure that the assumption is correct, we can use the chi square test

# check expected frequency
chisq.test(dataset2)$expected
# Chi Square Test
chisq.test(dataset$Status,dataset$Gender)

Result Interpretation :
The expected frequency of each variables is > 5, so the chi square test can be done p-value < 2.2e-16
p-value < 0.05, we reject the null hypothesis (Ho)

Ho: there is no a difference of resigned or not based on gender category
Ha: there is a difference of resigned or not based on gender category

Now we will use status and work period variables for fisher exact test

#  FISHER EXACT TEST
# create table of test variable
dataset3 <- table(dataset$Status,dataset$Work_Period)
dataset3

With many categories in the work period variable, it is difficult for us to find assumptions just by looking at the table, that is what the chi square test and fisher exact test are for

# check expected frequency
chisq.test(dataset3)$expected
# Fisher Exact Test
fisher.test(dataset$Status,dataset$Work_Period)

Result Interpretation :
There is some expected frequency of variables < 5, so we use fisher exact
p-value < 0.0004514
p-value < 0.05, we reject the null hypothesis (Ho)

Ho: there is no a difference of resigned or not based on work period category
Ha: there is a difference of resigned or not based on work period category

That's the application of the Chi-squared and Fisher Exact Test in R using a simple dataset, hopefully it's easy to understand by everyone who needs this explanation

https://github.com/RaharditoDio/Hypothesis-Testing/blob/main/Chi-Square%20_%20Fisher%20Exact%20Medium.R

--

--