Efficient Learning — the BBC’s Bayesian quiz engine
In this article, I’m going to give an overview of Bayesian statistics and talk about how we utilise user data to personalise and improve BBC Bitesize’s Efficient Learning GCSE quiz.
Let’s start with an example for context.
Suppose you flip a coin 3 times and get heads each time. Is the coin biased?
For most people the answer will be no, but why?
When determining the answer to this question people use two pieces of information:
- the outcomes of the 3 coin flips
- their presumed knowledge about coins being fair
A normal (legal tender) coin is unbiased and has equal chance of producing heads or tails, that’s why we use it to make decisions. It’s also pretty unusual to get a non-normal coin (The Royal Mint estimates that just over 1 in 40 coins is counterfeit). So when making the above decision you are weighing up how likely you are to get the observed 3 heads from a normal coin versus how likely it is that you have a coin that biases towards heads.
Although we’ve observed more heads than tails, we haven’t seen sufficient bias to outweigh our previous understanding that coins have a 50:50 chance of being heads or tails. What you’re doing is applying Bayes’ Theorem, possibly without knowing it.
Let’s start with some background of the Bayesian framework that underpins the efficient learning algorithm…
The binomial distribution is a discrete probability distribution. It is used in situations where you have a repeatable experiment that has 2 outcomes: success or failure. If we know the probability of success is p then the binomial distribution can tell us how likely we are to observe k successes out of n trials.
The formula is given by
Consider a coin flip example where a head is considered a success. The probability of heads (p) is 0.5. The above formula can be used to work out how likely we are to observe, for example, 3 heads (k=3) out of 5 flips of the coin (n=5). The result is 0.3125.
In our efficient learning project, the probability that a student will get a question correct is given by their mastery score. If we know the user’s mastery score then a binomial distribution can be used to work out how likely a student is to get 2 out of 3 questions correct. Here we would use k=2 and n=3.
Beta distributions are continuous probability distributions and can be thought of as the opposite of a binomial distribution. If we observe k successes in n trials, the beta distribution can tell us how likely different values of the success probability p are.
Beta distributions are commonly used in cases where you have very little data available. For example, you could flip a coin 1000 to accurately work out the probability of flipping a head, however, it might not be practical to do that. Instead, you might flip a coin a few times and use a beta distribution to infer how likely different biases might be given the outcomes of flipping the coin.
In the efficient learning quiz, we can’t just keep asking hundreds of questions to get a clear idea of a user’s mastery score — they would likely get bored and we have only a limited number of questions!
Typically, the beta distribution is expressed using 2 parameters α and β:
Below are some examples of some beta distributions for different values of α and β.
In the case where we know nothing about the success probability, the parameters α and β can be expressed in terms of the number of successes and failures through the simple relationship*,
- α = 1 + number of successes
- β = 1 + number of failures
or, in terms of k and n,
- α = 1 + k
- β = 1 + n - k
Consequently, the purple line above would be the probability distribution associated with 1 success and 1 failure. As you can see, the most likely value of p is 0.5, which is what you might expect from 1 success and 1 failure. Equally, the orange line is the result of 1 success and 4 failures, and the most likely value for p is 0.2.
In the case where α = β = 1, the beta distribution looks like a uniform distribution. This would occur if we had 0 successes and 0 failures — i.e. we don’t have any information about what the value of p might be.
In the efficient learning algorithm, we use beta distributions to represent a user’s mastery score. As we ask more questions we can get a more accurate view of a user’s ability. The purple distribution in the above plot would indicate that user’s mastery score is somewhere around 0.5, but we’re not really sure about it because the distribution is quite wide — it could plausibly be anywhere between 0.2 and 0.8. The orange distribution indicates a much more confident mastery score of around 0.2 because it is a much narrower distribution with a peak at 0.2.
The beta distribution is often cited as being a conjugate prior of the binomial distribution.
I’ve actually covered what that means above without giving too much technical detail so let’s delve a little deeper in relation to our binomial example.
If you’ve looked into Bayesian statistics before you’ll likely have come across Bayes’ Theorem:
On the right we have 3 terms:
- P(A) is our prior belief— what we believe to be true about our success probability before observing any data
- P(B|A) is our likelihood— how likely are we to observe the data given our initial beliefs. In the context of this article, this is the binomial distribution
- P(B) is the marginal likelihood—it is essentially just a normalising factor that ensures that the total probability is equal to 1
These three combine to produce the posterior on the left. The posterior tells us, given our initial opinion as well as the observed data, what should our new opinion be. Bayes’ Theorem provides a mathematical formula for combining our prior belief and any observed data.
In the efficient learning algorithm, we use Bayes’ Theorem to combine an initial expectation about a user’s mastery score (prior) with the results from their test (likelihood). The posterior is our prediction of their mastery score given their test results.
Beta Conjugate Prior
If we use a beta distribution for our prior then our posterior is also a beta distribution — this is what is meant by a conjugate prior.
The advantage of using a conjugate prior is that there is a nice analytic relationship between the parameters α and β in our prior and posterior.
Since we’re using a binomial distribution for the likelihood function, the α and β parameters of our posterior can be obtained by simply adding the number of successes and failures to the values of α and β in our prior:
- α_posterior = α_prior + k
- β_posterior = β_prior + n – k
In our efficient learning quiz, n is the number of questions answered and k is the number of correct answers. The above relations tell us, mathematically, how to combine our prior belief about a user’s mastery score with the results from their test. The beta distribution with parameters α_posterior and β_posterior is a probability distribution that represents a user’s mastery score. As we ask the user more and more questions the posterior distribution becomes narrower and more confident about their ability.
I previously stated that if you don’t know anything about the success probability then α and β parameters can be expressed as:
- α = 1 + k
- β = 1 + n – k
What we’re actually doing here is calculating the α and β parameters of our posterior, with a prior distribution given by α_prior = β_prior = 1. As I stated earlier, a beta distribution with α and β equal to 1 is in fact a uniform distribution.
When we first launched our efficient learning quiz, we didn’t have any knowledge about how our users were going to score in the quiz. As a result, we had to assume that all mastery scores were equally likely to occur across all students — we used a uniform distribution as our prior by using α_prior = β_prior = 1.
Now that the quizzes are live and we’re collecting data, does it make sense to initially assume all mastery scores are equally likely? In general, the answer is no.
The plot below shows the breakdown of scores across users of the quiz. It is clear that not all scores are equally represented; far fewer students got 3/20 than achieved 13/20. As a result, it is suboptimal to assume that the probability of a student having a mastery score of 0.15 (i.e. 3÷20) is the same as 0.65 (i.e. 13÷20).
We can convert the above frequencies into probabilities; instead of saying 1.2% students get 5/20, we can say there is a probability of 0.012 that a student chosen at random is likely to get 5/20. In the plot below I have converted the raw scores out of 20 into a mastery score by simply dividing by 20. The orange curve shows a beta distribution with parameters α = 4.71 and β = 2.75. This beta distribution is a very accurate approximation to the distribution of mastery scores across all users.
If, instead of using α_prior = β_prior = 1 as our priors, we use the above values then the algorithm will start off from a much more likely scenario. We can therefore easily incorporate this data into our algorithm by using the following relations for the mastery scores in our posteriors:
- α_posterior = 4.71 + k
- β_posterior = 2.75 + n – k
These values allow us to more accurately calculate whether a student gets a question wrong because they have a low level of mastery or whether they were just unlucky to get a question they didn't know the answer to.
While using an aggregate prior for all users provides significant increases in performance in the efficient learning algorithm it doesn’t provide a very personalised experience for the user.
Here are a few ways in which we break down the data to generate the most accurate and relevant experience for our users.
We can segment users in many different ways to find groups of students who behave similar to one another but differently from students in other segments. For example, students who have previously visited the study guide sections of Bitesize are more likely to achieve higher scores on questions related to that content.
Anecdotally, some exam boards have a reputation for being harder than others and we derive different priors for different exam boards. We also do the same for different subjects — physics tends to be more challenging than biology for a lot of students.
Time of year is also an important factor in calculating relevant priors.
Around exam times students’ knowledge reaches a peak due to their intense revision. We can derive priors based on data collected only during exam seasons and apply these only during subsequent exam periods.
We also see a difference between term-time and holidays — it tends to be more engaged users who study regularly during school holidays. We’re able to trigger different priors based on whether it is a school holiday or not, and we’re even able to do this is different regions of the UK at different times (shameless plug to another one of my articles).
In 2020 we launched a brand new quiz to help GCSE students with their home learning during the covid-19 pandemic. It is a highly adaptive experience but built without having any data. The underlying algorithm was designed using Bayesian statistics to allow us to make data-driven improvements once the quiz was launched. We’re now adapting and improving the algorithm in response to the data we collect from students taking the quiz.