What makes a good standardised test question?

My brother and I run www.bmat.ninja, a test prep platform for UK medical school applicants sitting the BMAT. The product offers, amongst other things, ~1300 practice questions for students to do. We’d like our questions to be a good representation of real BMAT questions, and so we thought it would be useful to analyse the ~250,000 questions completed on our site in the last admissions cycle to determine how “good” our practice questions are.

What makes a good question?

To figure out how “good” a question is, we first need to define “good”. To those reading without an intimate knowledge of the UK medical school admissions process, the BMAT is a test that students have to sit if they’re applying to study Medicine at certain universities. As an admissions test, the goal of the BMAT is to help universities distinguish between different applicants. We can see that if a question is too easy and is answered correctly by all applicants, it’s a “bad” question since it doesn’t allow you to distinguish between different applicants. The same goes for questions that are too hard that nobody gets right.

A “good” question, then, is one which “strong” applicants perform well on but “weak” applicants perform poorly on. Thankfully, we can now use the term good question without the quotation marks around the good. Regretfully though, we now have quotation marks around strong and weak so we need to define “strong” and “weak” applicants before we can do away with the obscure punctuation.

What makes a strong applicant?

Since we have no prior knowledge about how well a student using our site performs academically outside of the site, we have to be a bit incestuous and use the same data to classify students as we will to classify questions. This might be slightly dodgy statistical practice, but I think our sample size of ~250,000 is large enough for this not to make too much of a difference.

The BMAT is used by a handful of UK universities, which we can group as follows. These groups are based on information provided by these universities about BMAT scores for typical applicants and anecdotal data from people that we know at these universities. The charts in the BMAT 2015 Examiners’ Report were used to determine percentiles for each group.

Instead of simply classifying students as “strong” or “weak”, it makes more sense to put them into the 3 different groups above, plus an extra group for students in the 0–29th percentile.

To actually implement this using our data, we ranked the students based on the proportion of questions that they answered correctly on our site. We then used the percentiles above to classify them, putting the top 10% of students together, the next 37% together, and so on. We are, of course, assuming that the students using our site are representative of the wider pool of applicants.


How do we figure out which questions are good?

Let’s label our 4 groups A, B, C, and D in descending percentile order (so A is the 90–100th percentile group and D is the 0–29th percentile group).

A good question, by our previous definition of allowing us to distinguish between groups, is one in which Group A performs better than Group B, which performs better than Group C, which — you guessed it — performs better than Group D:

To actually implement this using our data, we considered each question individually, and worked out the proportion of students from each group that answered the question correctly. What we then needed to do was to quantify the extent to which the question allows us to distinguish between the groups, by assigning each question some kind of ‘score’ — a single numerical value. Luckily, this just requires some basic arithmetic:

Let pA, pB, pC, and pD be the success rates of each group on a particular question, e.g. if pA=0.4, then 40% of students in Group A answered the question correctly. Let

d1 = pA - pB, d2 = pB - pC, d3 = pC - pD

These ‘deltas’ are the differences in the success rates of pairs of groups, so if pA=0.4 and pB=0.2, then d1=0.2, meaning that Group A had a 20% higher success rate on the question than did Group B.

Finally, let

Score = (1+d1)(1+d2)(1+d3)

This formula gives the question’s score, where a higher score denotes a better question.

“But where does this score formula come from?”, I hear you say, and rightly so. There are obviously many different possible formulas that we could have used, all of which would have worked quite well. Our initial thought was to let

Score = d1 + d2 + d3

This kinda made sense because it meant that questions with high deltas would have high scores, just like we want, but actually when we expand this expression out, we find that

Score = d1 + d2 + d3 = pA - pB + pB - pC + pC - pD = pA - pD

This means that we actually only end up with the success rate difference between Group A and Group D, and we ignore groups B and C entirely. We then thought “Okay, why not multiply the deltas?”, to give us

Score = d1*d2*d3

While this did take all the groups into account, there was a slight problem because if we had 2 large negative deltas, we’d end up with a large positive score. This is bad because it means that our formula thinks that 2 wrongs make a right — mother wouldn’t stand for that.

To solve this problem, we needed to make sure that we were only multiplying non-negative values together to get the score. Since each delta ranges between -1 and 1, if we add 1 to each delta then they’d range between 0 and 2, which is non-negative. Thus, we ended up with

Score = (1+d1)(1+d2)(1+d3)

Score reaches its maximum value at ~2.36 when d1 = d2 = d3 = 0.333… .

This formula isn’t quite as glamorous as what you might see scrawled on a nerd’s window in a Hollywood blockbuster, but then again, I’m not a nerd in a Hollywood blockbuster (unfortunately).

So there you have it! We ended up with our list of questions in ranked order from best to worst. Here’s one of our best questions, with a score of 1.8 :

Correct Answer: C

Who do we care?

This is useful to us because we can now use this framework to continually improve our bank of questions by retiring questions that have low scores and keeping questions that have high scores.

I’m not sure it’s particularly useful to anyone not running some kind of online preparation platform for standardised tests, but I hope you found it interesting regardless.

As always, would love to hear your thoughts on any of this or anything else— @TaimurAbdaal. Also as always, please check out some other stuff that I’ve made at www.taimur.me. Finally, if you know anyone applying for Medicine in 2016, let them know about www.bmat.ninja!