Examining fairness in CS assessments

Matt J. Davidson
Bits and Behavior
Published in
14 min readMar 12, 2021

Fairness in assessments is important not only for its own sake, but also because fair assessments provide more valid information about test-takers. In this post, I describe a statistical method for investigating item fairness called Differential Item Functioning (DIF). I explain the method generally, share our analysis of a final exam in an introductory CS course, and show how to perform DIF analysis using R.

We know that both students and teachers use exam scores to determine how well learning is happening. Researchers might also use exam scores, to see how changing instruction or using a new learning tool influenced students’ learning. That’s why it’s important that we use high quality exams when assessing learning: we have to be able to trust that exams are giving us valid and reliable information about students.

When talking about validity for an exam, we mean that the exam measures (only, or mostly) what we want it to measure. To use an obvious example, if we want to measure programming skills, the results of a test of fraction operations would not be valid. The latest thinking in the psychometric community, by Michael Kane and others, is that a test’s validity should be thought of as an argument: that is, we should set out claims about the validity of a test, investigate those claims, and examine whether or not they are supported by empirical evidence.

While many things might threaten the validity of a exam results, we focus on the fairness of an exam. By “fair” we mean that an exam gives similar results for students who have similar knowledge, regardless of other factors. There are two reasons that fairness is important for exams. First, both students and teachers generally expect that exams are fair, and that their score on an exam is only a reflection of their understanding of the material being tested. Second, an unfair exam has a very weak argument for its validity. If some factor besides a student’s knowledge has a large influence on their score, then the result is not a valid measure of their knowledge.

an analog balancing scale; used as an image of fairness
Fairness in testing is a basic expectation of students and teachers, and an important component of a test’s validity.

But how can we know if test items are fair? And if we find out items are unfair, what should we do with them? In this post and the linked paper, we describe a statistical method to investigate item fairness called Differential Item Functioning, or DIF. At a high level, DIF methods do two things: they match students on their underlying knowledge, and then compare whether students with similar knowledge but a different group membership (like gender, age, or year in school) performed differently on an item. We also briefly discuss what you can do with unfair items.

We’ll focus on DIF methods for dichotomous responses, meaning all items can be marked as correct or incorrect. When comparing two groups, we call the larger group the reference group, while the smaller (usually the group we’re interested in) is called the focal group. If an item seems to be unfair, we call that a DIF item, and we say that it favors one group. DIF methods can find two types of DIF: uniform, meaning that an item favors one group no matter the knowledge level, and non-uniform, meaning that the favored group changes at some point along the knowledge levels.

DIF Methods

No matter the method, all DIF methods must use a matching criterion (how we match test-takers with similar knowledge) and a statistical test (how we determine if there is DIF). I’ll focus on the logistic regression and likelihood ratio test methods, since we used those in the paper.

The logistic regression (LR) method tries to find DIF items by fitting logistic regression models to the responses. Examinees are matched on their total test score, and three different logistic regression models are fit to the data. The first model assumes no DIF, and only uses an examinee’s total score to model the probability of answering correctly or incorrectly. The second model looks for uniform DIF by adding a term for group membership, and the third model looks for non-uniform DIF by adding a term for the interaction of group membership and total score.

To see if there is any DIF on an item, those three models are compared using a likelihood ratio test. This is a statistical test that compares the fit of two different models to the same data by comparing their likelihoods. For the LR method, the fit of the uniform DIF and non-uniform DIF models is compared to the no DIF model. If either model fits the data better than the no DIF model, then there is evidence of DIF. The magnitude of the DIF can vary from item to item. For LR DIF, this is examined through an effect size. One nice thing about LR DIF is that it can be used with small samples, possibly even as small as 25 examinees per group.

Where LR DIF is based on using a logistic regression model to find DIF items, likelihood ratio test (LRT) DIF uses an item response theory (IRT) model. Another improtant difference is that LRT DIF matches examinees based on an IRT-estimated knowledge level, rather than total test score. Otherwise, the methods are fairly similar.

The LRT DIF method works by fitting two IRT models to the responses: one model assumes that there is no DIF, and another assumes there is. IRT models estimate the difficulty and discrimination (how sharply the item distinguishes examinee knowledge) for each item. The no DIF model assumes the difficulty and discrimination is the same for both groups, while in the DIF model they are allowed to vary for each group. Just like in LR DIF, a likelihood ratio test is used to compare the fit of the DIF and no DIF models: if the DIF model fits better, then there is evidence of DIF.

Once we find evidence of DIF, we can see what type of DIF by looking at whether the difficulty or discrimination varies between the groups. If the difficulty is different we have evidence of uniform DIF, while differences in discrimination are evidence of non-uniform DIF. We can look at the magnitude of DIF for the LRT method by looking at the change in expected scores for an item.

All DIF methods work by analyzing responses to each item. Because most exams have more than one item, any statistical test is used multiple times, so it’s important to adjust p-values for multiple comparisons. Also, when the matching criterion is total score, most DIF analysis includes purification. If there are any DIF items on the test, then that total score includes results from biased items. Purification is an iterative process to try and arrive at a total score that is not affected by DIF items. Both p-value adjustments and purification are covered in the demonstration code below.

General Procedure for DIF Analysis

  • Check basic psychometric properties of items
  • Evaluate IRT assumptions (if using an IRT DIF method)
  • Apply chosen DIF method(s)
  • Remove or revise any DIF items with large magnitude

I’ll demonstrate this procedure as I walk through the analysis we conducted of a CS1 final exam.

DIF Analysis of a CS1 Final Exam

We applied both LR and LRT DIF methods to see whether the items on a final exam for a CS1 course were fair. I’ll focus on the results of our analysis in this section, and the next section will show how to perform a similar analysis using R.

The final exam we analyzed was similar to many end of course exams in CS1: students had 1 hour and 50 minutes to complete the exam, there were 3 items that measured code tracing skills, and 7 that measured code writing skills. Each item was graded on a rubric, meaning that students could get partial credit. We asked the course instructor to decide how many points on an item indicated understanding of the concepts being tested. If a student’s score was above that cutoff, we marked it correct, and incorrect otherwise. We did this for a few reasons, but the main one was that partial credit models require a much larger sample size to get stable results.

We decided to run DIF on two different groupings of students. The first was on the binary gender students had on file with the CS department, either male or female. The second grouping was on year in school, either first-year or beyond first-year. We thought the binary gender grouping would be interesting because of demonstrated issues in CS programs with retaining non-male students. We chose year in school to see whether more experienced students might perform differently on this final exam.

Basic psychometric properties

We began by examining the difficulty, discrimination, and reliability of our items. Overall reliability, using Cronbach’s alpha, was .77, which is acceptable for a course exam. Difficulty was calculated as the proportion of examinees who got the item correct. Discrimination was a correlation between the response to one item and overall score. And the change in reliability shows whether removing an item would improve the reliability of the test. Ideally you would see a range of difficulties, discrimination values all larger than .30, and no large increases in reliability if an item were dropped, which we found for this exam.

Evaluate IRT Assumptions

We also tested IRT assumptions, since we used an IRT-based DIF method. These are local independence, unidimensionality, and functional form. Local independence assumes that each response is due only to the examinee’s ability and the item. This can be verified by making sure that no items depend on the same block of code, for example. Next, we need to consider if the test is unidimensional, meaning that responses are primarily due to a single underlying knowledge or skill. This can be explored through factor analysis. Finally, we need to make sure that the functional form of our chosen IRT model is a reasonable fit to the response data. We can do this by fitting different IRT models and seeing which fits best.

There are more details about this in the paper, and I provide code below to test it on a sample dataset. For this CS1 final exam, we found evidence that local independence should hold, that the test was unidimensional, and that a 2 parameter IRT model was a good fit to the data.

Apply DIF Methods & Remove or Revise DIF Items

Now that we have confidence our exam did a decent job of measuring students, we can turn to the DIF analysis. You can see full results, including DIF estimates for each item, in the paper. For binary gender, we found no evidence of DIF on this exam. For year in school, we found one item that had significant non-uniform DIF, but the magnitude of the DIF was low, meaning the item did not need to be removed. Here is an item trace showing DIF for that item.

Item trace for the item showing non-uniform DIF in the CS1 final exam.

Item traces show, for all knowledge levels, the probability of a correct answer. The horizontal axis is examinee knowledge (with “average” as 0), and the vertical axis is the probability of a correct answer. Item traces to examine DIF include two lines, one each for the focal and reference groups. This is an easy way to see, in the case of non-uniform DIF above, where the favored group switches. For this item in the CS1 final exam, we can see that beyond first-year students are favored for knowledge roughly below average, while first-year students are favored above that knowledge level.

So why did this item show DIF? Based on the analysis we did here, we don’t know. That is one drawback of DIF methods — they can detect DIF, but they cannot provide much information about why DIF was observed. It can be tempting to speculate, based on the groups being compared, why we might have seen DIF on an item. Speculation like this can be helpful for generating hypotheses that additional research could examine. Such speculation should be done very carefully, because it is often informed (explicitly or implicitly) by our assumptions or stereotypes about the examinee groups. If it’s important to understand why DIF occurred, I recommend conducting cognitive interviews with examinees from various groups, or designing an experiment to control for various possible factors. For example, in our analysis, we might contact examinees to ask how long they studied, and see whether that is related to their year in school.

Running DIF Analysis using R

Now that we covered some basics about DIF methods, I’ll turn to showing how to apply DIF methods using R. All of this analysis will be done using a dataset from the R package difR. The full code can be found on here, which also includes code for conducting factor analysis to examine whether the responses are unidimensional.

We begin by loading required packages, and the data from the difR package.

library(tidyverse)
library(mirt)
library(psych)
library(lavaan)
library(difR)
data(verbal)
dat <- verbal %>% dplyr::select(-Anger)
items <- colnames(dat[1:24])
#create object with just item responses
itemsonly <- dat %>% dplyr::select(-Gender)

The sample data are responses to a questionnaire about verbal aggression. Examinees either endorsed or did not endorse each item, meaning they are “scored” dichotmomosly. You can read more about the data by typing ?verbal in the R console.

Basic Psychometric Properties

The function alpha() returns a number of reliability measures, including Cronbach’s alpha.

alpha(itemsonly)$total

The output gives reliability measures for the test as a whole. Cronbach’s alpha (as raw_alpha) is most commonly used. The code on github shows some other ways of estimating reliability. For this example data, it shows that alpha was .87, which means the test has good reliability. If alpha is below .70, the test has serious issues with reliability that should be investigated before the test is used for any purpose. Possible causes could be that the test is too difficult for the examinees, or that the test is measuring multiple things (alpha assumes unidimensionality).

#item difficulty
item_descriptives <- describe(itemsonly) %>% as.data.frame()
itemstats <- tibble(.rows=length(items))
itemstats$item <- items
itemstats$difficulty <- item_descriptives$mean
#adding discrimination and reliability
items_alpha <- alpha(itemsonly)
itemstats$discrimination <- items_alpha$item.stats$r.cor
itemstats$alpha_drop <- items_alpha$alpha.drop$raw_alpha
print(itemstats, n=24)

The code above will generate a table with the item name on the left, then the item’s difficulty, discrimination, and change in reliability. If any item has discrimination less than .30 or would increase reliability if removed, consider removing the item! These statistics indicate that the item is not functioning well.

Evaluate IRT Assumptions

If we want to check for uniform and non-uniform DIF, and our item responses are dichotomous, there are only two IRT models that we should check: the two- and three-parameter models. The two-parameter model (or 2PL) estimates a difficulty and discrimination for each item. The three-parameter model (or 3PL) is a 2PL but with an extra parameter for guessing.

#fit the IRT models using the mirt package
twopl_fit <- mirt(itemsonly,model=1,itemtype="2PL",SE=T)
threepl_fit <- mirt(itemsonly,model=1,itemtype="3PL",SE=T)

#comparing the model fits
anova(twopl_fit,threepl_fit)

The output from this compares these models using a likelihood ratio test. If the p-value is < .05, the second model is a significantly better fit to the data. In this case it seems that the 3PL is a better fit to this data than the 2PL. Despite that result, this demonstration code uses the 2PL in the DIF method. This is because the 2PL is a good fit for the vast majority of exams, and we wanted the code to have the wide applicability.

Apply DIF Methods

Now we can turn to the chosen DIF methods. Just as we did in the paper, I’l demonstrate using logistic regression and likelihood ratio test DIF.

### Logistic Regression DIF ####the line of code below is only set due to a bug in the difR package with the verbal dataset.
#when running this analysis on another dataset, this value likely will not need to be set.
PVAL <- NA
#this will test for uniform DIF
difLogistic(dat, group="Gender", focal.name=1, purify=T ,p.adjust.method = "BH", type="udif")
#this will test for non-uniform DIF
difLogistic(dat, group="Gender", focal.name=1, purify=T, p.adjust.method = "BH", type="nudif")

As mentioned above, this method uses total score as the matching criterion, so we set purify = TRUE so that the total score is not influenced by any DIF items. In addition, we adjust the p-value by setting p.adjust.method = “BH”, which uses the Benjamini-Hochberg p-value adjustment for multiple comparisons.

The first block of the output shows the chi-square statistic (Stat.), p-value (P-value), and adjusted p-value (Adj. P) for each item. If any of the adjusted p-values are below .05, that item is a DIF item. The output provides a handy list of those items after showing the statistics. Below that, we see the effect size for each item, which lets us know the magnitude of the DIF. There are two commonly used effect sizes for LR DIF, which generally give similar results. A guide is included with the output.

Magnitude of DIF

The effect sizes are really helpful for knowing what to do with any DIF items. If the effect size is “negligible” we don’t need to do anything with that item. It is a good idea to review the item to see if there are any obvious aberrations or explanations for the DIF, but otherwise it can be left in the test. If the effect size is “moderate ”or “large”, however, we need to remove the item from the test and either revise it or replace it with another item.

### Likelihood ratio test DIF ###
#run using the mirt package
#it requires fitting a `multipleGroup` IRT model to the data, which needs a character vector of group membership
gender_groups <- as.character(dat$Gender)
#fitting the model
dif_gender_model <- multipleGroup(itemsonly, model=1, gender_groups, SE=T)
#this will show results for uniform DIF
DIF(dif_gender_model,which.par = "d",p.adjust = "fdr")
#this will show results for non-uniform DIF
DIF(dif_gender_model,which.par = "a1",p.adjust = "fdr")

To use the likelihood ratio test method, we first have to fit a multipleGroup model to the data. Then we pass that model to the DIF function, the parameters we want to test (“d” for difficulty/uniform DIF, and “a1” for discrimination/non-uniform DIF), and the p-value adjustment method (using Benjamini-Hochberg, which is the default when p.adjust = "fdr”).

This provides similar output to the logistic regression. We want to focus on the last three columns, which show the chi-square statistic (X2), p-value (p), and adjusted p-value (adj_pvals). If any of the adjusted p-values are below .05, then the item is a DIF item.

To see the magnitude of DIF, we run the following.

empirical_ES(dif_gender_model)

This displays a lot of different measures for each item, based on expected scores. Unfortunately, there’s no standard effect size for LRT DIF — I’ll discuss the first two, and how we might interpret them, but you can learn about the others by typing ?empirical_ES into your R console.

The first two measures are SIDS and UIDS. Both are the average change in expected scores for focal group examinees as a result of the DIF. Because our items are scored dichotomously (i.e. 1 or 0), this means that the effect size is the expected change in the probability of answering correctly. The difference between SIDS and UIDS is that SIDS averages the change in expected scores, while UIDS shows the change assuming that the focal group was favored across all knowledge levels. This means that SIDS is more appropriate for uniform DIF, while UIDS is more appropriate for non-uniform.

While the effect sizes for LR DIF have established criteria for negligible, moderate, and large effects, the situation is less clear for LRT DIF. However, guidelines developed by Dorans and Kulick for a different DIF technique are likely applicable for the LRT effect sizes. They found that items with an effect > .10 were generally problematic items, and that it was worth inspecting those items further to be either removed or revised.

DIF Analysis Should be Standard Practice for CS Assessments

DIF analysis is a powerful statistical technique that can help us detect items that are biased. It can be used to see whether items are measuring groups of students with similar knowledge the same. We must commit to using DIF methods to analyze assessment data to make sure that our assessments are providing trustworthy results. This is especially true for anyone designing and validating a new assessment or survey tool, but DIF can also be used by instructors to evaluate how well their exams are measuring, as we demonstrate in the paper.

If, as a community, we are committed to the idea of fairness, then we must use DIF as a standard part of test validation.

The full paper has much more detail about what we found in the exam we analyzed, as well as more discussion of IRT assumptions and some technical aspects of DIF methods. I also included links to the presentation we recorded for SIGCSE 2021, as well as the slides in that presentation. Finally, if you want to conduct DIF analysis yourself, the R code linked below is a great place to start! Please reach out if you have any questions about our analysis, the code, or how to apply DIF methods to your own test data!

--

--

Matt J. Davidson
Bits and Behavior

PhD Candidate in Educational Measurement and Statistics at UW Seattle | Improving CS assessments with psychometrics and process data