Studying? Let an AI Generate Q&As to Quiz You!

Generate customizable questions using only context with MT5 model

Parinthapat Pengpun
10 min readJun 21, 2022

Quick Links

Backstory (The Motivation)

Photo by Tim Gouw on Unsplash

That was me about a month ago (figuratively speaking) cramming at 1:00 A.M. for my history exam at which would take place in about 9 hours. Most of my friends who I usually study with were now asleep, and so I was on my own.

Normally, the technique that I use when trying to study a subject quickly is to:

  1. Very quickly skim through a section.
  2. Try to answer the questions at the end of the section.
  3. Slowly read parts of the section which I had answered wrong.
  4. Answer those questions again.
  5. Move to different sections and then repeat the steps.

Sadly, this does not work for history though, since the class was given in a lecture format (no textbooks, and we weren’t allowed access to the presentation). So instead of my ultra-fast technique, I had to slowly read my notes and hope for the best.

At one point I tried to come up with my own questions, but I quickly realized that I knew the answer to every question I came up with. 🥲

Planning the AI

Fast forward a few months later and I was accepted into the AI-Builders program! After hearing about the fact that I need to create a final project, I thought about my time during the exams and texted my friends whether they had the same problem. (Spoiler: they did)

So for my final project, I decided that I would tackle exactly the problem described above!

To be precise, what I had imagined in mind was a system like this:

Input text: สร้าง 2 คำถาม: เฟซบุ๊ก (อังกฤษ: Facebook) เป็นบริการเครือข่ายสังคมสัญชาติอเมริกัน สำนักงานใหญ่อยู่ที่ เมนโลพาร์ก รัฐแคลิฟอร์เนีย เฟซบุ๊กก่อตั้งเมื่อวันพุธที่ 4 กุมภาพันธ์ ค.ศ. 2004

Output text: 1. เฟซบุ๊กคืออะไร A: บริการเครือข่ายสังคมสัญชาติอเมริกัน 2. เฟซบุ๊กก่อตั้งเมื่อไร A: วันที่ 4 กุมภาพันธ์ ค.ศ. 2004

And so, I quickly got to work. Searching up for some question generation dataset… NONE?! There were only QA datasets.

Image list of datasets

At that moment though, I realized that a QA task is kind of similar to Question Generation. And so I picked my poison: XQuAD, Thai QA (squad version), iapp-wiki-qa-dataset. The reason for these datasets were because the formats of these datasets are easy to adapt to a Question-Generation style.

From this:

To this:

And then I had to do that for every single dataset. Furthermore, there were some questions within the dataset that had no answers, or no questions (I didn’t type that wrong). I had to remove HTML markup, double spaces, and empty parentheses.

Well… that was easier said than done — In total, I spent 5+ hours trying to transform the datasets. During the training process, the dataset will be split using the train-valid-test split (80%, 10%, 10%).

Exploratory Data Analysis

Now that the hard part is over, it’s time for some fun data analysis. Since the style of this text does not give us many features. There was only data analysis I could do. Regardless, this was what I came up with!

According to our dataset and the two charts above, most of the paragraphs have 1–5 questions. This means that our model would likely, perform well for tasks that it has to generate 1–5 questions.

What we can infer from this chart on top is that longer word count in the context (Source Text) does not mean that the questions (Target Texts) will be any longer. Statistically, it will probably be shorter. This may be beneficial to the model because it would learn to generate concise text.

From this most common words word cloud we can can gather that a good chunk of the dataset that we have has to do with Thailand. Furthermore, we can also infer that a lot of the data relate to history/events because they have:

  • ปี
  • ที่
  • วันที่
  • สร้าง
  • เมื่อ
  • ใน
  • โดย
  • พระ
  • ประเทศ

All, in all this Exploratory Data Analysis gave us quite some insightful information.

Metrics & Baselines

So I need to find a way to actually get a empirical representation of how well my model is doing. How do I do that? I’m glad you asked — using metrics. There are a a lot of metrics to choose from, but I chose the following:

  • Meteor: A metric created to solve problems with the BLEU. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.
  • GLEU: (Google-BLEU) Modified version of the BLEU to correlate better with human judgement by computing minimum recall & precision.
  • BLEU-4: Measuring 4 n-gram matches from a reference and a candidate
  • CHRF: Character n-gram f-score (computes matches by character)
  • ROUGE-L: Measuring n-gram matches using LCS

I arrived to this conclusion by looking at other projects with similar ideas for example: https://github.com/patil-suraj/question_generation and also machine translation metrics since these tasks are similar.

At this point it is important that I mention that I’m doing this project in a data-centric way. Therefore the baseline that I will be using is by training a model with only the xquad dataset. So we will be able to visualize the improvements using more data we’ve gathered.

Building the AI

And now it’s time to starting building the model! After researching for a long while, I settled on using the mT5 model. Why? Well the most unique features were:

  1. It was multilingual (meaning support for texts that are mostly Thai but include other languages)
  2. It’s special architecture which included specifying a task prefix.

These features allowed me to implement the model with the ability to create the exact amount of questions that the users need. And so after getting the data I went to https://github.com/Shivanandroy/simpleT5 and quickly started prototyping. After I see that the initial results went well, I rewrote the simpleT5 code in pytorch-lightning! That was quite painful, but well worth it. Because now I was able to have more fine-grained control over the model training.

Model Variants

The original variant of the model was the default variant. However, after some testing (manual not metrics) I’ve realized that the model messes up when it tries to create a question involving decimal numbers.

This occurs possibly because the default model uses “1. or 2.” as the separation. So the model could probably be confused about whether or not a “2.3 ล้านคน” in the generated text means number two or not.

Separated Variant

This led to the creation of the separated variant. This model was trained like this:

  • separated: “What is the meaning of life and the universe? A: 42<sep> What is 42? A: The meaning of life and the universe.”
  • default: “1. What is the meaning of life and the universe? A: 42 2. What is 42? A: The meaning of life and the universe.”

But… the problem is, now the model has no idea how many questions to generate because the numbers are gone! (Please bear with me)

Number_Separated Variant

This then led to the creation of the number_separated model. Which was supposed to fix the problems of the separated variant. I tried to solve this by changing the separation token back to numbers.

  • number_separated: “<1> What is the meaning of life and the universe? A: 42 <2> What is 42? A: The meaning of life and the universe.”
  • separated: “What is the meaning of life and the universe? A: 42<sep> What is 42? A: The meaning of life and the universe.”

This model was quite promising. Yet it was still a little confused about how many questions to generate. The only way to solve this was with more data, but how do you increase data, without really increasing data?

Augmented_Number_Separated Variant

The answer: data augmentation. A single row of data which says: สร้าง 10 คำถาม. Can be split into 10 rows each saying:

  • สร้าง 1 คำถาม…
  • สร้าง 2 คำถาม…
  • สร้าง 3คำถาม…

And so on. Using this technique of data augmentation I was able to increase the amount of rows in our dataset from ~4500 to ~14000!! The formatting of this model is same as the number_separated variant but was trained on the augmented dataset.

This was by far the most promising of the models. It could handle text which had numbers in them reliably and had a lower chance of generating an invalid answer than the default variant.

The limitations of this model is that it isn’t as reliable generally when compared to the default variant. This is due to the fact that the dataset may be a little too small for this style of training to work properly. So for the most part, if the data does not include decimal numbers, the default variant is the best for the task.

Model Validation

To validate the model, I ran it across the metrics. These are the results:

So at first glance, it would seem that aug-numsep (augmented-number-separated) is the best model. But after manually scrolling through the some of the predictions and labels it became clear that the aug-numsep variant was good at generating questions, but it has trouble generating the correct amount of questions specified.

And so, this was the reason that I decided to do error analysis on the default variant since it is overall, the best variant.

Error Analysis

After looking the predictions vs the labels of the model. I grouped the various errors into these types.

  • Incomprehensible Generated Text (Wrong Grammar that makes the question not possible to understand)
  • No Answers in Generated Text
  • Wrong Amount of Questions
  • Repeating Questions
  • Wrong Answers/Content

Since I couldn’t come up with any automated way. I manually went through the 449 preds & labels and classified the errors. That was very tiring, but I learned a lot of history in the process haha.

Out of the 449 elements in the test set, only 30% of them errored. Although this is big to some people, I’m extremely happy seeing such results from a beginner like me!

These were the types of errors that happened in that 30%.

Deployment

This. CSS was one of the most hardest things I had to work with. I don’t remember it being this hard? Anyways the rundown of the architecture is

Frontend

I was initially tempted to try and use streamlit. But I couldn’t get the huge mT5 model to work on streamlit cloud. And if I were going to simply write an API wrapper in python, I should instead sharpen my webdev skills and build something cool & impressive with React at the same time.

Isn’t that just gorgeous? I spent about 12 hours on it in order to add:

  • Animations
  • Dark Mode
  • Responsiveness
  • Beautiful UI

And that hard work paid off! 😍

Server

Alright this one is really fun. Since I really didn’t want to disturb the amazing people in the AI-builders program or eat up group’s AWS credit, I first did not want to deploy with the AWS instance that they have given. So I tried:

  • Google App Engine 💣
  • Heroku 💣
  • Self-hosting on my RPI4 💣

Suffice to say, they all exploded. And so I was left with no choice but to ask the staff in the AI-Builders program to help provision me an AWS instance to use for inference. My conscience was eased, though, after I re-read some old messages and realized they had a lot of AWS credits left.

So I quickly put up my “fast api” 😆 and deployed it onto AWS. And I tried to call my server from my frontend…

Frontend: Hey buddy, can I get api?

Github Pages: CANNOT CALL HTTP RESOURCE FROM HTTPS!!!!

Yep. So I used 2 more hours trying to figure out how to setup a reverse proxy with Caddy in order to get HTTPS working. After that was working seeing the text pop-up on my frontend was like dream come true. But my work wasn’t done yet.

I realized that my sever was doing the inference every single time the API was called. The solution? Caching. To do so, I setup redis on the EC2 instance and then used a package called fastapi-cache2 in order to cache the results of each call for 7 days.

So finally, I was left with an amazing API with and amazing frontend!

Conclusion

All thanks to the folks at the AI-Builders program. I was able to learn and build all of this. Without them, it wouldn’t have been possible. Thank you, to all the respected teachers, mentors, staff members, and my colleagues.

--

--