How Quizlet does smarter grading: Using ML and NLP to grade millions of answers

Ling
Tech @ Quizlet
Published in
6 min readSep 17, 2021

--

By Anna Khazenzon and Ling Cheng

One of the most effective ways to study is to practice recalling the material from memory, according to decades of learning science research. Based on this principle, Quizlet’s Learn study mode enables students to study through a variety of question types, like flashcards, multiple choice questions, and graded written (typed) answers.

Question prompt for a written answer

Answering written questions is an effective way to build knowledge, but grading these free response answers accurately can be challenging. Historically, Quizlet required the student’s answer to be a near exact match to the expected text (in the study set) in order to mark it as correct. When students answered questions correctly, but made typing errors or reworded the right answer, they were marked incorrect. This could be frustrating, and slowed down progress. In fact, until we made our grading smarter, one of our top pieces of feedback from students was that our written answer grading was too strict.

We also received signal from a feature that allows students to override our grading directly, e.g. marking “I was correct”. Because of our grader strictness, students were overriding 13% of answer grades, and we wanted to keep students learning with minimal friction.

So what was our previous grader missing? When we looked at the data, we found that for most of the overrides, the answer was just 1 or 2 characters off from the expected answer, e.g. for outdoor a user might type outdoors, which students considered correct.

In other cases, particularly when the expected answer was longer than a few words, we saw that students were overriding to mark correct even when the text was very different. The student often used synonymous words, or phrased the answer in a different way, and there could be many acceptable right answers. These longer (3+ word) answers had a much higher override rate of 34%.

Other ways of wording the answer would be marked incorrect

We realized there were two distinct problems, the solutions of which became two new features:

  • Typo Help to allow for minor text differences
  • Smart Grading to grade semantically, or based on meaning

Our Approaches

Typo Help

Over the years, Quizlet’s grading had become incrementally smarter through the addition of specific rules. Surface-level fixes, like ignoring capitalization, were already in place. Specialized rules, like accepting one of multiple possible answers, were added to support desirable study experiences. The majority of the remaining issues stemmed from minor text differences that we weren’t yet handling. We evaluated the extent to which lightweight linguistic features could add desirable leniency, based on impacts to historical override rates. We decided to implement two features that differentiated between incorrect answers that users chose to override to correct vs. considered truly incorrect:

Levenshtein distance (or edit distance), which measures the character similarity between two strings, was the most impactful feature. If the expected and actual responses for a question are apart by either 1 character or at most 15% of their characters, then we mark the answer correct.

Removing articles (a, an, the) from grading input accounted for considerable overrides, and was easy to implement. If the expected and actual responses differ only because of the presence or absence of articles, then we mark the answer correct.

Although helping with typos usually improves the experience, there are cases when learners need to practice getting answers correct, letter-for-letter. For example, based on the override patterns, we decided to disable Typo Help for language learning content.

Smart (Semantic) Grading

Semantic Grading required a much more sophisticated solution, utilizing machine learning (ML). This problem closely resembles Semantic Textual Similarity (STS), a well-studied research area in the field of Natural Language Processing (NLP) or Understanding (NLU), involving understanding the meaning of sentences.

More similar sentences have more similar embeddings. Image from TechViz

The state-of-the-art approaches include sentence embeddings, which are vector (a sequence of numbers) representations of sentences that encode their meaning. These allow us to compare two sentences by measuring the similarity of their embeddings. If a similarity score is above a certain threshold, we mark it correct.

To generate sentence embeddings, we use BERT-based models fine-tuned on our unique answers data.

There was a major internal concern that if we reduced strictness in grading, we might accidentally mark students correct when they were actually wrong, leading them to false confidence that they knew their material. We labeled hundreds of examples to create ground truth data, ran the grading model on them, and found that the model had a very low rate of these “false correct” grades. This gave us confidence that our algorithm was good enough to test in the real world. Also, the model performed well on public benchmark STS datasets.

All the above would be marked incorrect without Smart Grading

No machine learning model that solves a complex problem will be 100% accurate, but we were essentially trading off 100% “false incorrects” (when the answer was not an exact match) for a few percent of “false corrects”. Also, there will always be a small chunk of answers that fall in a very subjective correctness category, and students can still override our grading based on their judgement.

As we looked through the data, we found areas where semantic grading wouldn’t make sense, and disabled it for those, like mathematical formulas and numbers.

In order to scale the grading service, we tested and used model compression techniques to be able to run the model quickly and cost effectively, while maintaining quality. We host the service on Google Kubernetes Engine, and are able to grade within a median latency of 50ms.

Results

Both Typo Help and Smart Grading were able to decrease override rates by 50% or more, and we saw more students persevere through studying all their material. We heard great feedback from students, validating that it worked well and helped to test themselves beyond rote word-for-word memorization.

And these gave the company continued confidence that we could use both data-informed rules and computationally intensive ML models to make a big impact in student learning, as we continue to innovate.

This example came from a user’s tweet praising of Smart Grading.

Future Directions

On the experience side, we’re refining when answers should be typed. Better experiences could include breaking down long text with fill-in-the-blank questions that focus in on key words (another NLP feature that we’ve built).

An important consideration when building tools for studying user-generated content is learner control. A key learning for us was to include a prominent option to turn the Typo Help feature off. In the future, we hope to improve default grading settings that work best for the material and user.

On the Smart Grading data science side, we are working on providing better feedback about a wrong answer, i.e. identifying the most important incorrect or missed words. We would like to continue to fine-tune the quality, and eventually enable this for languages other than English.

Note: Typo Help is available for free to all students. Smart grading is accessible with a Quizlet Plus subscription.

Acknowledgements: Thanks to Murali Kilari from our Data Science team who worked on the semantic grading problem end-to-end from the research phase to production service. Thanks to the Quizlet study team who brought this to life for our users.

--

--