Chicken Soup for the AI Brain: Labeling

9 min readDec 11, 2023

Data Labeling: The Absolutely Needed “Hassle”

Amidst changing business decisions and task priorities, I had my eyes on one particular- the ultimate, in my opinion- calling for Quizium, which was to serve top-notch questions that truly assist learning. Our team made a promise to our users to be the best learning companion for video learning, and were determined to commit to that promise.

Yes, we were definitely lucky to be living in the current era of Large Language Models (LLM) and artificial intelligence (AI) development. The technological boon enabled us to analyze and understand YouTube video content almost instantly, and even transform them into interactive quiz experiences. But to meticulously tailor these questions to bring maximum educational impact and engagement was the next level question. The real challenge was to creatively integrate AI in not just interpreting content, but in ensuring that the questions were aligned with the content’s learning objectives, factual context, and user needs.

In short, we needed our AI model “brain” to have that sophisticated blend of AI insights and human educational acumen. This brought us on a mission to create an extensive labeling dataset and a careful refinement of prompting mechanisms.

The Truth Hurts

In the beginning, Quizium was already making “okay” questions, which was unironically, not okay. For example, the generated questions stayed true to the video content without much problem, but not necessarily the educational context. Our engine was smart enough to ask, “What is the main role of mitochondria in a eukaryotic cell?”, but not quite sophisticated enough to rule out “Who was this video sponsored by?”. It sometimes even attempted to torture students by asking “How many times was the word ‘polypeptide’ mentioned in the video?”. We also noticed a few other problems with the answer and distractor options. One of the most common tendencies our early engine had was making the answer option way to obvious compared to the distractors.

Image 2: Questions created by the early question generation engine

It became imperative that our question generation AI has its own quality check model (or as we creatively call it, the QC model), in addition to the prompting refinement. The idea was for it to have its own elaborate scoring and filtering system to rule out “bad” questions from being served. And to create that model, we were going to have to first make a labeled dataset. The idea was to train the QC model with the labeled dataset, and make it possible for the Quizium AI model to assess and filter questions on its own.

Measuring What’s Good or Bad

The first step was data collection and labeling. We needed a certain guideline for our human annotators to assign a label or score based on this predefined criteria.

Okay, so we had a natural sense of what makes a… shitty question. But everyone had a different idea on what makes a “good” question. A literal flashback to Theory of Knowledge class. Did we want our question to be precise? But what if we wanted to deliberately puzzle the user, to check their attentiveness? Did we want to challenge the learners by posing difficult questions? But what if that was going to diminish their engagement and motivation to study?

To narrow down our approach, we shifted our focus to prioritizing the very essence of what makes a “good” question on Quizium, specifically. Quizium’s primary goals were to:

(1) Encourage users to complete watching the video they chose
(2) Help watchers remember key points of the video, and
(3) Engage them in critically reviewing the important aspects of the video instead of watching passively.

Then, we began by dissecting what each of these goals entailed in terms of question design. For one, to ensure content completion, we wanted to make sure that no element of the question was a dead giveaway. This meant our answer option could no longer play “guess the main character”. For memory retention and critical engagement, we explored techniques in question framing so that question stems would strictly echo and reinforce the educational context of the video.

As we delved deeper into the nuances of question creation, it became apparent that each segment of a question — the stem, answer, and distractors — indeed played a unique role in determining its overall effectiveness. Each of them, therefore, had to be evaluated as a whole but also individually. In creating the evaluator list, we also noticed some issues that were not as apparent. For example, no matter how well-written, the overall diversity of distractor options also contributed to the obviousness of a question, both in terms of sentence structure and the actual content.

And to turn these discoveries into measurable evaluators, we worked together with the AI researchers and our Data Manager to refine the wording and exact definitions. The conversation went something like this:

“Number 21 says ‘The answer is too easy to guess’, but that’s a relative measurement. Easy, in what way? Too easy for the user? How will we score that?”
“Right, it is in relation to what the answer option looks like compared to the distractor options.”
“We also have to think about cases where the length of the answer and distractor options are nearly the same but the answer is still too obvious.”
“So let’s break it down further to clarify what it is exactly that makes it so easy to guess.”
“It may not even be the answer that’s the problem. It could be that the distractors are way to irrelevant to the whole thing.”

By the end of long discussions, we developed around 30 distinct evaluation categories. Depending on the number of the multiple-choice options, one question could have up to 49 labels, and these labels were going to be the first chicken soup for our AI brain.

Birth of R.MarkAI

Once we had a better idea on what our labels were going to be exactly, we were quite happy. We finally had the chicken for the chicken soup. Before this one question:

“We are not going to have people Google Sheet the crap out of this, are we?”

The thought of having people manually sift through thousands of questions on Google Sheet, assigning quality scores one by one, sounded rather primitive. Something had to be done to make the process more efficient and manageable. Given the limited time we had to complete the entire labeling project, we had around a week to build a specialized but lightweight platform that would streamline the annotation process. We couldn’t build an entire kitchen, but we at least needed a pot and fire.

Fortunately, creating a list of features and then crossing out the non-essentials wasn’t too challenging as what we absolutely needed was quite clear. The goal was to help the human annotators so that they can evaluate and score questions with ease, while also maintaining consistency and accuracy in their assessments. Supporting different label types? Absolutely needed. Admin control? Not quite. Grouping evaluation categories into groups? Helpful. Export and import? Sure, but could be covered through simple coding.

On top of the features, the developer put much effort in making sure the overall interface was intuitive to minimize the learning curve. Within a few days, our data labeling platform was up and running. In addition to speeding up the labeling process, R.MarkAI also allowed the team to easily track the progress and review the data in real time.

Image 3: A look of R.MarkAI, our labeling tool for the data labeling project

In the beginning, we were estimating the team to finish evaluating 1,000 questions, giving us approximately 40,000 label data for Phase I. Now, we have a new projection and are expecting to have double that amount.

What Actually Happens With the Chicken Soup

Even with the labeling process sailing somewhat smoothly, creating the actual QC model still wasn’t a simple “abracadabra, here’s the verdict for the question.” For one, we needed to wait until the data labeling and reviewing process was completed. And there are multiple layers involved in creating the QC model after the data acquisition:

Training a Question “Estimation” Model

With the labeled dataset in hand, we can first train a quality estimation model. The input to the model would be the reviewed label annotations, and the output would be the predicted label for a new question that was not used in the training dataset.

Model Evaluation

Once the quality estimation model is trained, the team will evaluate its performance using a separate validation dataset. The assessment will be on how well the quality estimation model’s predictions align with human judgment. This step helps ensure that the quality estimation model is accurate and reliable.

Filtering and Ranking

After validation, the quality estimation model will be further developed to assign a specific “score” based on the label data. Then it will filter out what would be deemed “poor quality” questions generated by the AI model- this is where the model gets closer to becoming the QC model. Questions that receive low quality scores from the model can be discarded or flagged for further review.

Iterative Improvement

As we continue to use the AI model to generate questions and collect new data, we can periodically update the labeled dataset and retrain the quality check model. Over time, the quality of questions should improve, as it learns from the feedback provided by the quality estimation model.

Feedback Loop

We will also have additional human annotators and subject matter experts who will play a crucial role by regularly reviewing and refining the labeling criteria. Human expertise will ensure that the quality check model aligns with the desired standards for question quality.

The good news though, was that as the team was annotating and reviewing so much question data, we were able to spot additional types of recurring, subtle patterns with question generation that could be improved through prompt refinement. This solution did not necessitate a quality check model. We could also focus on working with our prompt engineer to enhance the primary ability of the question generation AI model. Due to this, we now have a two-track question quality improvement projects, tackling both the short and long-term resolutions.

Where We Are, and Where We Want To Be

We do have a few more weeks left in completing Phase I of the labeling and question quality project. But I am thrilled to confidently say that Quizium’s question quality has improved significantly already. A case in point:

Image 4: Question generated from the same passage, before and after the question generation AI improvements

With our ongoing efforts in data labeling, model training, and quality control, I am personally hoping to eventually reduce our dependence on extensive human labor. Our ultimate goal is to empower and evolve the AI model to a stage where it can self-assess while consistently delivering high-quality questions.

And speaking of high-quality questions, we recognize the diversity in learning domains and their unique needs, such as in mathematics and language learning. Each field requires a tailored approach in question design. We, therefore, are already keenly eyeing the future milestone, which is to optimize the model even further so that it adapts specifically to different academic fields.

A huge shoutout to all members of the team, for their commitment and desire to make Quizium better and provide our users a more efficient, and ever-evolving learning experience. Hoping you would be a part of this journey and delve into what Quizium stands for!

Quizium: AI Learning Companion for Youtube

Don't just watch, interact! Turn any video into an interactive quiz

bit.ly