Quizaic — A Generative AI Case Study

Part 4— Assessing Quiz Accuracy

Published in

Google Cloud - Community

11 min readJul 5, 2024

This is the fourth in a series of articles about a demo application I created called Quizaic (rhymes with mosaic), which uses generative AI to create and play high quality trivia quizzes.

Here’s a table of contents for the articles in this series:

In the previous article in this series, we covered the generative AI component of our app, and how we use prompting to generate quizzes and images. In this article, we’ll explore the question of how best to assess the accuracy of our AI generated quizzes. It’s important to understand this because people will not enjoy using this app if the quizzes have wrong answers.

So how do we do this? Anyone can look at a quiz and assess its content subjectively, however, that’s neither repeatable nor automated. We need a mechanism by which we can apply solid engineering practices to automatically assess quiz quality and accuracy. That’s the focus of this article.

Dimensions of Assessment

Let’s first consider how we might assess something as fuzzy as a trivia quiz. Two dimensions come to mind:

accuracy — Is the correct answer actually correct and all incorrect answers actually incorrect? Does the quiz contain one and only one correct answer?
quality — Does the quiz make sense? Are questions related to the requested topic? Are questions non-obvious (e.g. by not giving away the correct answer in the question)? Do questions or answers repeat? Does the quiz match the desired difficulty level?

In this article we will focus on accuracy because that seems like the first order of importance — if a quiz is inacccurate then quality seems of little value.

How would we solve this problem before LLMs?

Let’s imagine, in the pre-LLM world, we had access to an unlimited but unreliable collection of trivia quizzes and we wanted to estimate our confidence in the accuracy of those quizzes. We would be reduced to having humans check them, which is tedious and costly.

But maybe, in the post-LLM world, there’s a way to automate this process…

Can we solve this problem in the post-LLM world?

We start with an unorthodox proposition:

Let’s use a large language model to assess the accuracy of its own output.

At face value, this sounds ridiculous because it seems akin to allowing students to grade their own homework. We know that LLMs are prone to hallucinations, so wouldn’t their assessments be as unreliable as their primary outputs? Surprisingly, if done carefully, this counter-intuitive approach works quite well and gives us a statistically sound measure of quiz accuracy. If you’re still skeptical, that’s ok, I was too. Suspend your disbelief for a moment, and allow me to explain how this works.

Start with a baseline

The first step is to test whether an LLM is any good at assessing a set of quizzes with known accuracy. If we had a corpus of quiz questions with known accuracy, we could use this dataset to test the assessor, i.e. to judge how well an LLM performs at grading quizzes. Well, such a gold mine of data exists in the form of the Open Trivia Database. This is an open source repository of human-curated trivia questions, with nearly 100% accuracy.

With a bit of Python programming, we can download a copy of the opentdb dataset, and transform it into a JSON document. The result looks something like this:

[
  {
    "question": "Where was Kanye West born?",
    "correct": "Atlanta, Georgia",
    "responses": [
      "Chicago, Illinois",
      "Los Angeles, California",
      "Atlanta, Georgia",
      "Detroit, Michigan"
    ]
  },
  {
    "question": "What was the name of Marilyn Monroe's first husband?",
    "correct": "James Dougherty",
    "responses": [
      "Kirk Douglas",
      "Joe Dimaggio",
      "James Dougherty",
      "Arthur Miller"
    ]
  },
  {
    "question": "In which of these TV shows did the chef Gordon Ramsay not appear?",
    "correct": "Auction Hunters",
    "responses": [
      "Hell's Kitchen",
      "Auction Hunters",
      "Hotel Hell",
      "Ramsay's Kitchen Nightmares"
    ]
  }
]

Transform each question into four assertions

We could feed these questions and associated answers directly into an LLM. But a better strategy is to separate them into four separate assertions, which we can then shuffle to avoid contextual bias. That gives us a more objective way to test the veracity of each question/response pair.

What do I mean by transforming questions into assertions? You know the game “three truths and a lie”? Well, every multiple choice question is essentially one truth and three lies. We can decompose any question into four assertions, one of which should be correct and three of which should be incorrect, as illustrated below:

Using this idea, we can decompose 1,000 trivia questions from the Open Trivia Database into 4,000 assertions, of which 1,000 are known to be true and 3,000 are known to be false. We can then submit those assertions in batches to a large language model and ask it to return the truth value of each, as illustrated below:

Assessing the assessor

Note that the four assertions in the above example are all related, since they come from one particular trivia question. In practice, we shuffle the 4,000 assertions to avoid any locality bias due to related assertions being in close proximity.

The result of this process gives us an estimate of a given LLM’s assertion grading accuracy. At the time of this writing, Google’s Gemini 1.5 Pro model grades 4,000 randomly chosen Open Trivia assertions with roughly 90% accuracy. This tells us that we can use an LLM to assess our generated quiz with a reasonably high degree of accuracy. Now let’s use this assessor to judge our generated quizzes.

Assessing our quizzes

Next, we generate a bunch of quizzes, using a given model under test. Then we convert those quizzes into assertions, in the same way we decomposed the Open Trivia questions. Finally, we use Gemini Pro 1.5 to assess the accuracy of those quizzes.

This process will give us an assessment of assertions, but how do we generalize the result to apply to aggregate quizzes, which contain some number of questions, each of which consists of a set of four assertions. The answer to that question requires a brief detour into a magical bit of probability theory called Bayes Theorem.

Detour— Bayes Theorem

Let’s imagine your doctor gives you the generally unwelcome advice that you have tested positive for a certain undesirable disease. While this may sound unpleasant, before you lose too much sleep, it’s possible that your test result was a false positive, i.e. that you don’t really have the condition. In order to determinate your actual risk, it’s worth asking your doctor two important questions:

What is the accuracy of the test?
What is the prevalence of the disease in the general population?

Let’s use some real numbers to make this concrete. Assume the following:

The test is 99% accurate. In other words, 99% of people who are sick test positive and 99% of healthy people test negative.
The prevalence of the disease in the general population is 1%.

Given those numbers, how nervous should you be about your test result? Take a guess before reading on.

To answer this question, let’s define two random variables:

have_disease represents the event that you have the disease.
tested_positive represents the event that you took a test for the disease and received a positive result.

We define the conditional probability of an event, written P(a | b), as the probability that a is true, given that b is true.

Given that definition of conditional probability and the numbers provided above, we can now define some probability values:

The probability that any given person has the disease, P(have_disease) = .01
The probability that you will test positive, given you have the disease, is P(tested_positive | have_disease) = .99
The probability that any given person will test positive is P(tested_positive) = .0198

Here’s how we calculate that last value: in a population of 10,000, 1% or 100 people will have the disease and 99% or 99 of those people will test positive. 9,900 people don’t have the disease and 1% or 99 of those people 99 will (falsely) test positive for the disease. This gives us a total of (99+99)/10,000 = 198/10,000 = .0198.

Now here’s what we really want to know:

What is the probability we have the disease, given we received a positive test result? Mathematically, what is P(have_disease | tested_positive)?

Here’s the magic of Bayes Theorem: this conditional probability can be calculated using the following formula:

Substituting our random variables and numerical values into the formula above, we obtain:

So your risk of actually having the disease, given you received a positive test result, is only .5 or 50%. Hopefully this is less concerning than your initial reaction to the test result.

Now let’s apply Bayes Theorem to our quiz assessment problem.

Bayes Theorem applied to assertion accuracy

Let’s define some new random variables:

assertion_accurate represents the event that a given generated assertion is accurate.
tested_accurate represents the event that a given assertion is deemed accurate by our assessor.

Hopefully, you’re starting to recognize that this problem is closely related to the one highlighted in the detour above, where an assertion being accurate is analogous to having a disease, and our assessor judging an assertion as accurate is analogous to receiving a positive test result.

Just as we did above, let’s now define some probability values:

The probability that any given assertion is accurate is generally unknown and depends on the model used. Since we can’t know this value explicitly, we’ll parameterize it using the variable X. In other words, P(assertion_accurate) = X
Based on our tests with Gemini Pro 1.5, the probability that an assertion will be assessed as accurate, given it is accurate, is P(tested_accurate | assertion_accurate) = .9
The probability that any given assertion will be assessed as accurate can be calculated as follows: in a population of 10 assertions, we will have 9*X true positives and 1*(1–X) false positives so
P(tested_accurate) = (9*X + 1*(1-X)) / 10 = (9X + 1 - X) / 10 = (8X + 1) / 10 = .8X + .1

Similarly to the example above, here’s what we really want to know:

What is the probability an assertion is accurate given it was assessed as such? Mathematically, what is P(assertion_accurate | tested_accurate)?

Using Bayes Theorem, we can calculate that conditional probability as a function of one unknown variable, P(assertion_accurate), which we’re calling X:

Now we can vary P(assertion_accurate), the unknown accuracy level for any given model, and calculate P(assertion_accurate | tested_accurate) for each value. The results are shown in the following table:

Here’s a graph of that data:

As you can see, the results are very favorable regarding the confidence level in our results. Even if the underlying population of generated quizzes is only 10% accurate (i.e. if X = .1), the probability that a postively judged assertion is actually correct is 50/50. If our underlying population is 50% accurate (i.e. if X = .5), there is a 90% probability that a positively judged assertion is actually correct.

A per-assertion accuracy of 80% seems like a reasonable estimate for Gemini Pro 1.5 based on empirical observations. Given that assumption, any assertions passing our test will have a 97% chance of being accurate! For the rest of this article, we’ll assume P(assertion_accurate) = X = 95%, though the actual value is likely to be considerably higher for a high quality model like Gemini Pro 1.5.

How can we use this information?

This methodology gives us a way to test assertions, but a multiple choice quiz is made up of questions, each consisting of a group of four assertions. We’d like to be able to say something about out confidence in the accuracy of a given question, i.e. a given group of four assertions. What we can do is run the four assertions through our test and check the results.

Let’s define two new random variables:

k is the number of assertions in a given question assessed to be accurate.
question_accurate is the event that a given question with four assertions is accurate at an aggregate level, i.e. the question contains one correct response and three incorrect responses.

For any given question, k can take on any of 5 values: {0, 1, 2, 3, 4}. Now we want to know the conditional probability that a question is accurate, given its k score. Mathematically, we want to calculate P(question_accurate | k = n), for n = 0, …, 4.

We now know that the probability of a given assertion being correct is on the order of .95. With that information, we can calculate the probability that a given question will have i correct assertions, P(k=i). This is a Bernoulli Process with four trials, where each trial has P(assertion_accurate) = .95 (as determined above). The general formula for calculating the probability of a Bernoulli Process with k results out of N trials is as follows:

We can use this formula to calculate the various values of P(k = i):
- P(k=0) = .95⁰ * .05⁴ = 0.00000625
- P(k=1) = 4 * .95¹ * .05³ = 0.000475
- P(k=2) = 6 * .95² * .05² = 0.0135375
- P(k=3) = 4 * .95³ * .05¹ = 0.171475
- P(k=4) = .95⁴ * .05⁰= 0.81450625

This gives us a confidence level for a given score. If a question yields a score of k=4, we can be about 81% confident in the accuracy of the overall question. If we get a score of k=3, we have only a 17% confidence level. Any score lower than k=3 strongly suggests that the question is inaccurate These values will improve over time as the underlying model accuracy improves. For example, an X value of .99 yields a confidence score of 96%.

Finally, we have what we’ve been looking for. We can take any generated question, decompose it to a set of four assertions, run those assertions through our assessor (in bulk so only one round trip needed per quiz) to generate a k-score in the range 0–4 for each question. Then we simply choose the matching k value from the list above to get a probability estimate for the overall question’s accuracy.

We can then assess the overall accuracy of a question according to the following rules:

if k = 4, mark the question as likely accurate, i.e. “good”
if k = 3, mark the question as likely inaccurate, i.e, “questionable”
if k < 3, mark the question as very likely inaccurate, i.e. “poor”

We could apply this assessment interactively, or as a background function, to give any generated quiz question a confidence score. All questions should be tagged in the user interface with this confidence score.

An Alternative Approach

The method above decomposes questions into quartets of assertions, so that assertions can be shuffled and judged independently without locality bias. However, it’s possible we could do our assessments on a per-question basis, which would entail less complexity.

My colleague Mete Atamel tried such an approach in this github repo and achieved good results, suggesting that it might be more efficient to simply test questions at an aggregate level without doing any decomposition. I’ll try that in a subsequent revision of Quizaic.

In the next article in this series, we’ll explore some of the lessons we learned building Quizaic and working with generative AI in a real world application.

Next Article: Part 5 — Lessons Learned