Week 1— FlashCards

Published in

AIN311 Fall 2022 Projects

3 min readNov 14, 2022

Hi, we are a two-student group that will be trying to create an ML model for their AIN311 course.

This is the first of the many blog posts we will publish regarding this project. Stay tuned for a new post every Sunday.

Introduction

As everybody knows AI changes our lives for the better day by day. As two AI Engineering students, we thought we could kill two birds with one stone and create a project for our course which could help us study better while saving us time. With these ideas in mind here is our project:

FlashCards

The project is basically a Question Generation (QG) system. Question Generation is a fundamental task in NLP. The model will generate questions with answers given a text.

Now let’s jump to the more technical stuff.

DataSets

For this project, we will be using two datasets with similar formats but different concepts. The reason behind choosing multiple datasets is to create a more diverse training set so that our model could perform better in real-world scenarios.

First DataSet is HarvestingQA (Du and Cardie, 2018) the one million paragraph-level QA-pairs dataset. For more information about the dataset please refer to Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia

Since HarvestingQA is based on Wikipedia articles we think it is one of the most related datasets for educational QG systems.

The second one is TriviaQA which is a reading comprehension dataset containing over 650K question-answer-evidence triples. Here’s an example of the TriviaQA dataset:

These DataSets are less popular than other QA datasets(ex. SQuAD) but according to their papers also more challenging than them. [1][2]

Work Plan

So far we’ve decided on the goal of our project and the datasets we will use. Also, we examined the popular papers related to our project to give us an idea about the project's technical perspective and the steps ahead of us.[3][4].

The next step for us is to improve our knowledge in the NLP field and research the language models for NLP (exp. BERT, GPT2, T5) in more detail.

In the next week, we will be sharing the results of the next step. See you.

İlkim İclal Aydoğan

Görkem Kola

References:

[1] Du, X., & Cardie, C. (2018, May 15). Harvesting paragraph-level question-answer pairs from Wikipedia. arXiv.org. Retrieved November 14, 2022, from https://arxiv.org/abs/1805.05942

[2]Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017, May 13). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv.org. Retrieved November 14, 2022, from https://arxiv.org/abs/1705.03551v2

[3]Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.

[4]Tang, Duyu & Duan, Nan & Qin, Tao & Zhou, Ming. (2017). Question Answering and Question Generation as Dual Tasks.