Item Response Theory for assessing students and questions (pt. 2)

Luca
6 min readJan 18, 2020

--

In a previous post I presented an introduction to Item Response Theory (IRT) and the possibilities it enables, such as having an accurate estimation of the actual knowledge level of each student and gaining a better insight into their learning process. Still, that was just scratching the surface of Item Response Theory, since I only presented the simplest IRT model: the Rasch model, often called “one-parameter model”, which gives us the possibility to estimate students’ skill level and questions’ difficulty. This post digs deeper into IRT, presenting more complex models which overcome some of the limitations of the one-parameter model. If you are new to IRT, I’d suggest you to have a look at my other post before reading this one.

Overcoming the Rasch model

The Rasch model is called one-parameter model because it estimates only one parameter for each assessment item: the difficulty. Although this is a fundamental characteristic of each question, it can be empirically observed that there are some other aspects that cannot be modeled with the difficulty only. For instance, there are some questions for which the probability of correct answer is almost unrelated to the skill level of the student answering: basically, all the students have the same probability of getting the answer right, regardless of their skill level. Another example is that case of questions that can be correctly answered “by chance” (think about multiple choice questions); thus even students with an extremely low skill level have a chance of guessing the correct answer. These characteristics can be modeled by adding some parameters — the discrimination and the guess factor — to our IRT model, building the two-parameters and the three-parameters IRT models.

Discrimination: assessing the quality of a question

The discrimination — usually named a — indicates how rapidly the odds of correct answer change with the skill level of the student answering the question. Basically, it represents the steepness of the item response function. The following image shows the item response functions of two questions that have the same difficulty but different discrimination. In the image, the horizontal axis represents the skill level of the students answering, while the vertical axis represents the probability of correct answer.

Why is it important?

As showed in the figure above, if a question has low discrimination it is not capable of discriminating between skilled students and “bad” students, thus it does not provide any value towards estimating them. Basically, once a student has answered a question with low discrimination, we do not know almost anything about her skill level, regardless of the difficulty of the question. Also, some questions might have a negative discrimination: this means that they are often answered correctly by bad students while the best students usually miss the correct answer! These are obviously anomalous cases and the cause often resides in questions that contain some errors in the text or in the possible choices (in case of multiple choice questions). For these reasons, the discrimination is often used to assess the quality of each question and the ones with low discrimination are either discarded or fixed before being used to assess students. This is similar to what happens with questions that are either too easy or too difficult: indeed, if a question is answered correctly by all the students it is given to, it does not provide information for evaluating them.

The math behind the discrimination

From a mathematical perspective, adding the discrimination to an IRT model is fairly easy and it can be done by adding the coefficient a to the formula of the item response function:

Considering the estimation of the parameters, almost nothing changes with respect to the one parameter model: it is still done via likelihood maximization. The only difference is that now two parameters are being estimated for each question, thus more answers are needed to provide an accurate estimation with respect to the number of answers required for the Rasch model.

Guess factor: modeling “correctness by guessing”

Another important parameter is the guess factor, which represents the probability that a student without any knowledge correctly answers the question “by chance”. The clearest example is the case of multiple choice questions but - from a theoretical perspective — all the questions might have a guess factor greater than zero. The following figure presents the item response function of two questions with the same difficulty, the same discrimination and different guess factors.

Having an accurate estimation of the guess factor is very useful, mainly for two reasons. First of all, if we do not have an accurate estimation of the guess factor, we cannot obtain an accurate estimation of the difficulty and the discrimination of the question, as well as of the skill level of the student, thus affecting the whole effectiveness of the IRT model. Secondly, questions that are very easy to guess are — by definition — unable to distinguish between good and bad students (similarly as it is the case for non-discriminative questions), thus they should be removed from the pool of assessment items.

Guess factor: the math

In order to consider the guess factor c, the formula of the item response function is changed into:

In this formula there is one additional parameter to estimate (it is the three parameter model), thus it requires even more samples to reach an accurate estimation; for this reason, it is sometimes added a “fixed” guess factor to the model, thus including it in the model but without having to estimate an additional parameter. From a theoretical point of view, it is possible to use an IRT model with the guess factor and without the discrimination, but it is very uncommon in practice.

Slip factor: modeling human inconsistency

There is still one parameter that is sometimes added to the IRT models, although it is much less frequent than the discrimination and the guess factor: the “slip factor” (s). It is the analogous of the guess factor but in the opposite direction: indeed, it represents the probability that a student with perfect knowledge of a topic fails to answer correctly the question. Basically, it can be seen as a parameter capable of modeling the lack of consistency that intrinsically affects every human. The formula of the four-parameter model is:

and the item response function becomes:

Similarly to what happens for the guess factor, estimating the slip factor (which can be associated either with each question or with each student) requires a lot of interaction between students and questions, because four parameters have to be estimated. Thus, it is sometimes considered as a fixed parameter of the model, without performing an actual estimation.

Conclusion

In this post I completed this quick introduction to Item Response Theory, presenting three additional parameters which can be used to implement more complex models. In particular, three parameters were presented:

  • discrimination, which can be considered as a measure of the quality of each assessment items and represents how well the item can discriminate between students whose skill level is above or below a certain threshold;
  • guess factor, which represents the probability of correctly answering ”by chance”;
  • slip factor, which is the probability that even a student with the highest skill level fails the question.

The following image presents two examples of item response functions, varying the three parameters mentioned above, thus showcasing the effects that these parameters have on the IRT models.

--

--