Item Response Theory for assessing students and questions (pt. 1)

Luca
7 min readJan 11, 2020

--

One of the most challenging tasks of the educational domain is performing an accurate assessment of students’ knowledge level, which consists in quantitatively understanding “how good” a student really is. However, this might seem a trivial problem: we all attended school and took some exams, every time receiving a mark supposed to measure our skill level. Sometimes, these marks are even used for rankings, awards, etc. But what is their actual meaning? Can we be sure that students with the same mark have equal skill levels in a certain topic? No, we cannot: their exams might have contained different questions, or they might have had the same exam but failed different questions. Exam marks only measure how well each student performed on the exam he/she was given, providing only a suggestion of his/her skill level, which is a non-observable value (technically defined “latent trait”).

In many cases, it is reasonable to use exam results as approximations of the actual students’ skill level: it is easily explainable and doesn’t require much effort to be computed. Still, something better could and should be done. Educational institutions often consider final grades as a metric for evaluating and comparing courses and academic years, but they should instead focus on the knowledge level of the students. Indeed, the objective of educational institutions should be maximizing the knowledge level of the students enrolled or — more precisely — maximize their learning gain, which is the difference between the initial and the final knowledge level. That’s the whole point with them: students are there for learning, and the only way to measure how well they are doing is focusing on the students’ skill levels, not on the marks.

How can we measure this skill, in practice? The concept of skill can only exist together with the concept of difficulty of assessment items (i.e. questions, problems, exams, etc.). Item Response Theory (IRT) [1] is a statistical technique which can be used to estimate latent traits of students and questions. In its simplest implementation — named Rasch model [2] — it leverages the answers given by a set of students to a set of assessment items to estimate the skill level of each student and the difficulty of each item, but other models estimates other parameters as well, which we will see later. Once these latent traits are known, they can be used in place of the exam grades, since they are a much more accurate estimation.

The math behind IRT

In order to understand how IRT works, it is necessary to take a look at the math which it is built upon. Given the results given by a set of students to a set of assessment items, the knowledge levels and the difficulties are estimated in order to maximize the likelihood of the observed results. In order to do this, for each assessment item a function — named item response function (i.r.f.) — is defined. This function provides, given the difficulty of the item and the knowledge level of a students, the probability of that student answering the question correctly. Several families of item response functions are used in the literature, but the logistic function is the most common one:

In the formula shown above, the theta (θ) indicates the skill level, while b indicates the difficulty of the item. The parameter a indicates the discrimination of the assessment item, but it will not be presented in details in this post. The item response function can be plotted by evaluating it for all the possible skill levels; the i.r.f. of two items of different difficulty are shown in the following figure.

It is important to notice that, even though the plot above represents only skills in the interval [-7; 7], there is no theoretical limit to the skill level of the students nor the difficulty of the items. In the plot it can be seen that students with the same skill level have a lower probability of correctly answering the item with higher difficulty than the easier item (the one in blue). According to the intuition, very bad students (i.e. with a low skill level) have very low probability of answering correctly, while good students have high probability of answering correctly both the questions. Also, it can be noted that students that are “infinitely bad” (i.e. the ones on the left-hand side of the plot) have almost no chance of answering correctly, while students that are “infinitely good” are almost certain of answering correctly.

Using calibrated questions to estimate a student

Thanks to the concept of item response function mentioned above, it is possible to leverage the answers given by a student to a set of calibrated items (i.e. items whose difficulty is known) to estimate his/her skills level θ̃. This is done by maximizing the result of the multiplication between the item response functions of the questions that were answered correctly and the inverse of the item response functions of the questions that were answered erroneously, with the following formula:

In the equation above Qc is the set of questions correctly answered and Qw is the set of questions that were answered erroneously.

Let’s assume that a students was shown three different assessment items, whose item response functions are shown in the following image:

Let’s assume that the student answered correctly question A and wrongly question B. Then, the estimated skill level is obtained as shown in the following image.

Lastly, let’s assume that she correctly answered also question C. Thus, the previous estimation is changed and the final estimation is obtained as follows.

An important advantage of this approach consists in its capability of providing not only an estimation of the skill level, but also the uncertainty of the estimation. Indeed, the wider the final probability distribution, the more uncertain the estimation is. This information can be leveraged to understand how many questions must be given to a student before obtaining an estimation whose uncertainty is below a desired threshold.

What we can do with IRT

As I mentioned in the introduction of this post, the grade obtained with the standard approach is well motivated in many cases: indeed, it can easily be explained ans is fast to compute, way faster than an IRT estimation. Also, it does not come with an uncertainty value which, on one hand, is a limitation but, on the other, free the instructors from having to deal with it. However, IRT could still provide huge advantages from other point of views, that are not directly linked to the grade each student is given.

Indeed, it can be used as a measure of the effectiveness of learning material and, in general, of whole courses: by measuring the skill level of the students before and after consuming the material, it is possible to see the impact of each content and therefore which ones have to be improved and which ones, on the other hand, are already effective.

Moreover, IRT could be used to check whether the exams that are given for the same course in different terms/academic years have similar difficulties or on the other hand are unbalanced.

Also, having an estimation of the evolution of the skill level of each student throughout the course enables target support which is not available otherwise: indeed, it would be possible to understand which ones are struggling and are possibly at risk of dropping out (which is one of the biggest problems of online learning) and the instructor could focus on them in order to help them overcoming their difficulties and successfully completing their path.

Item Response Theory has many more implications and perspective advantages, but they are not introduces here as they require the concepts of “discrimination”, “guess” and “slip”, which were not presented here.

Lastly, it is important to remark that, although IRT was introduced for the educational domain, it is a very general technique that proved effective in other domains as well. Indeed, it is sufficient to have two sets of items whose elements are competing with each other (possibly even only one set whose elements are competing with other elements from the same set) to make IRT possible, and this was done in many domains other than education, for instance:

• assessing machine learning models
• matchmaking in video games
• recommender systems

--

--