Selecting Question Formats to Maximize the Testing Effect

Shane Mooney
Tech @ Quizlet
Published in
9 min readDec 20, 2017

--

Quizlet aims to make studying better for students, and it’s important that our tools are not only fun and engaging, but also effective. One big part of that is helping students focus on the terms that need the most work, but another important challenge is choosing the right question format for studying those terms. Most of our study systems use a single question format, while Test uses a random mix. But Learn is our first study system that uses multiple question formats adaptively, attempting to choose the best format for each term as it’s studied. So how do we decide which question formats to use?

Cognitive science research can inform this decision. The testing effect says that active retrieval practice is an effective way of studying, so we want to choose question formats that promote retrieval practice and discourage passive re-reading. Additionally, the retrieval effort hypothesis states that difficult questions are better than easier questions, but only if they can still be answered correctly. This suggests that there’s an optimal level of difficulty we should aim for, in line with the idea of desirable difficulty. We already order terms to target this optimal difficulty, but we can also optimize the format of the questions we ask on those terms.

The new Quizlet Learn launched in Spring 2017, and that summer we tested out some changes to how we use the different question types. After de-identifying and aggregating the resulting study sessions, we did some analysis to help us improve how we select types within Learn. Read on to learn more!

Question Formats

When Learn was launched, it supported five question formats:

  • Typed Response: Given a prompt, type the answer. We require the input answer to more or less exactly match the correct answer, but have some rules in place to be lenient about things like capitalization and punctuation when appropriate.
  • Self-Graded Flashcards: A virtual flashcard with a front and back. Additionally, we ask the student to honestly evaluate whether they were able to recall the back side before flipping the card with “Know” and “Don’t Know” buttons.
  • True/False: Given a pair of items, answer: Do they go together? True or false.
  • Multiple Choice: Choose which of the 4 options go with the prompt.
  • Multiple Choice with None of the Above: Like multiple choice, but with an additional “None of the Above” option
Examples of Self-Graded Flashcard, Multiple Choice, Typing questions

These formats can be organized into two categories: cued recall and recognition questions. A cued recall question has a “cue” (or “prompt”) that corresponds to an expected answer that must be recalled, purely from memory. Both typed response and self-graded flashcards are examples of cued recall questions.

We consider the other question formats to be recognition questions. The student is given a prompt and options that may correspond to the prompt. The student must either identify the correct answer among the incorrect alternatives, or determine that the correct answer is not present. In either case, this task is easier than recalling the answer purely from memory.

Differentiating Factors

What distinguishes these question formats from each other? What about them might make one a better choice than another? There are several differentiating factors that we looked into to gauge the overall effectiveness of each format.

Guessability

For questions that have a fixed number of options it’s possible to guess the correct answer purely by chance. While this might lift a student’s test score, it’s not ideal for studying. If someone can guess the correct answer without actually knowing it, they don’t gain the learning benefits of actively recalling it. Additionally, we’re not getting as good a signal about which terms the student actually knows, and it’ll be harder to help them focus on the terms they don’t know.

Our multiple choice questions have a 1 in 4 chance of being guessed, multiple choice with none of the above have a 1 in 5 chance, and true/false have a 1 in 2 chance. Cued recall questions can’t be guessed, at least not purely by chance. If a student is able to narrow the answer down to a few options and guess correctly among those, they’ve still recalled the correct answer, but have also recalled another incorrect answer. Self-graded flashcards can’t really be guessed, but there is a possibility that someone will be too easy on themselves and mark a card as “known” even if they didn’t fully know the answer.

Probability of correct answer by pure-chance guess

Difficulty

Difficulty is one of the biggest factors determining what makes a good question, and there’s more to it than just guessability. Both multiple choice and true/false questions allow for process of elimination. For example, in a multiple choice question, if you know three of the options are incorrect, the fourth must be correct. This allows you to choose the correct response without actually knowing what the real answer is. Like guessing, we expect this to reduce the learning benefit.

A more subtle difference affecting difficulty is recall vs. recognition. These are two fundamentally different types of memory tasks, and recall is usually more difficult than recognition. We can get a relatively unbiased view of how difficult a question format is by comparing how often they are answered correctly for terms the user has not studied before. The formats that allow for recognition are answered correctly much more frequently than those that require pure recall. Noting that even the hardest question format is answered correctly on the first try over 60% of the time, and because we would only expect students to guess questions they don’t know the answer to, this difference is larger than we would expect from guessing alone.

Percentage of correct answers on the first question students have ever answered on a term

Learning Benefit

The main thing we care about is how much students learn from each question. After all, that’s the whole point of studying. Learning benefit is closely related to difficulty; more difficult questions tend to have higher learning benefit. If you want to learn as effectively as possible, you should stay just out of your comfort zone and challenge yourself with more difficult questions. This idea is known as desirable difficulty.

This suggests that recall questions, being harder, should be more beneficial than recognition questions. While this seems to be true for the most part, there may be a point at which a question is too difficult. Correctly recalled answers have more learning benefit than incorrectly recalled questions (Pavlik, Anderson 2005). We want to aim for a sweet spot where it’s difficult to get the right answers, but not so difficult that the student gets most of them incorrect. This implies that we should start out with easier question formats when the material is unfamiliar, and progress to more difficult question formats as the student starts to learn it, a technique known as guidance fading.

Time

Learning is the main benefit of study, but we also have to consider the costs. Studying takes time that could be spent in other ways. Because we know our users have a limited amount of time for studying, what we really want to maximize is the amount of learning per minute. It might be better to ask two fast questions than one slower but more effective question.

Time spent per question, including feedback

Fun

It’s unavoidable that studying needs to be challenging, but that doesn’t mean it can’t be fun and motivating. If a question is so frustrating it makes students want to give up and stop studying, it’s not doing a great job of helping them learn no matter how effective it might have been. We need to pick question formats that are challenging, but that students actually want to use.

One way we can estimate how enjoyable the questions are is to compare how often our users stop studying early when using each format. To measure this, we ran a test to evaluate the three recognition question types. By default, when our algorithm called for a recognition question, we’d randomly choose between True/False, Multiple Choice, and Multiple Choice with None of the Above. We compared that behavior against three variants that used each of those question types exclusively. We then measured how much progress each group made within Learn before either completing or abandoning study. When we saw certain question types correspond to completing more of the set, we interpreted that to mean that students found that question type more fun and motivating.

Difference in proportion of terms in set answered at least once when each question format was used, as compared to a random mix of the three.

Which question formats are the best?

It’s not clear that there’s a single best question type for all cases, but within the cued recall and recognition categories, there are clear advantages and disadvantages to the different formats.

Cued Recall

In order to ensure students have fully learned the material they’re studying, we want to make sure they’re able to answer recall questions correctly. Learn requires each term to be answered correctly with a recall question, either typing or self-graded flashcards, before finishing. Typing questions require you practice typing the term, which may aid learning. However, they can take longer, especially if the answer is very long. Self-graded flashcards require a little more discipline to use effectively, but can be faster, and work with any study material, even if it’s impractical or impossible to type. With that in mind, we updated Learn to use typing questions as the main cued recall question format, but fall back to self-graded flashcards if it looks like the answer is either too long to practically type, or if the answer has no text at all (in the case of images or diagram locations).

Recognition

We start out on recognition questions as a way of easing the student into the more difficult cued recall questions. Originally, Learn would randomly choose between multiple choice, multiple choice with none of the above, and true/false questions. However, after doing some testing and evaluation, we identified some disadvantages to true/false, and the “none of the above” option, which lead us to remove them in favor of standard multiple choice.

Multiple Choice with None of the Above: The inclusion of a “none of the above” option makes multiple choice questions more difficult, and after answering these questions correctly, our users were more likely to be able to answer cued recall questions correctly. However, there are a number of drawbacks. If the correct answer is “none of the above”, it’s possible to choose the correct response without actually knowing what the correct answer is, and like guessing, we don’t think this is very helpful for learning. Additionally, they take longer to answer, and when we used them more extensively, students were more likely to stop studying early, presumably because they found the format frustrating. Overall, the costs outweigh the benefits, so we chose to drop the “none of the above” option in our multiple choice questions.

True/False: The main advantage of true/false questions is that they are fast to answer. However, they’re a poor question format in other ways. With only two options, the guessability is extremely high, and if the correct answer is ‘false’, it’s possible to get the question correct without knowing the actual answer, essentially the same problem as the “none of the above” option. It can also be difficult to give clear feedback in this case, because we would need to communicate not only that the correct answer was ‘false’, but also what the information is that correctly corresponds to the prompt. Finally, the true/false format lead to higher study abandonment rates, perhaps because it is less rich and interesting than multiple choice questions.

Conclusion

Choosing the right question type for a given scenario is a balancing act. We have to consider the difficulty of the question type, how well we think the student knows the term at a given time, how fun and engaging the question type is, and how much time it will take to get through. Based on the research and evaluation detailed above, we’ve decided on starting with multiple choice questions and progressing to cued recall questions (either typing or self-graded flashcards) — but this may change in the future.

Because the ideal question format may depend on how well a student knows a term at a given time, we’d like to next experiment with choosing between multiple choice and cued recall questions more adaptively, based on our estimate of the student’s knowledge of the set. Our hope is that this more personalized behavior would help our users learn even more efficiently than our current fixed multiple choice to cued recall progression.

If you’re interested in helping us continue to make learning better, Quizlet is hiring!

Acknowledgements: Thanks to Theresa Pittappilly, Jeff Chan, Jen Liu, Eric Bomgardner, Turadg Aleahmad, Karen Sun, Amalia Nelson, and Dan Crowley for editing this post.

--

--