WEEK #1 Question Assistant Barlas

Mehmet Berat Ersarı
AIN311 Fall 2022 Projects
4 min readNov 9, 2022

Hey Folk,

Today is a great day because our first blog post is being published. We introduce our project in this blog post. Let’s start with the name of the project. The project name is Question Assistant Barlas. We thought why not give a human name to our question-generating machine like Alan Turing’s Christopher. Barlas is a name from Turkish History. Let’s dive into the details of the project.

Brief Overview Of The Project

We will create a question-generation model as you understand from the project name. Question Generation is one of the basic problems for AI. Question Generation systems need label paragraphs and related questions for training data. Our model will predict a related question from a given paragraph. There are a lot of training data in English (WebQuestions dataset [1], Simple Questions Dataset [2]). Most works were done in the English Language. We will try to generate questions in the Turkish Language. That will be a hard period for us due to the lack of Turkish datasets and academic work on question generation in the Turkish Langauge.

Dataset

As far as we investigated, there are only two datasets in the Turkish Language. The name of one is TQuad [3]. This dataset is the Turkish Question & Answer dataset on Turkish & Islamic Science History within the scope of TEKNOFEST 2018 Artificial Intelligence competition. Now, it is available as an open-source dataset. It has paragraphs, related questions, and answers in a JSON file. The dataset contains 275 paragraphs and 892 QA pairs for the dev set, 2232 paragraphs, and 8308 QA pairs for the train set. You can see a data point from this dataset below.

The name of is another one is XQuAD [4]. XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question-answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi.

Related Work

As I said, most works were done in English Language. We just found one academic work for question generation using Turkish Langauge. That is “Automated question generation and question answering from Turkish texts” [5]. That work is done by Fatih Çağatay Akyön et al. in January 2022. They also used TQuad and SQuAD datasets for this academic work. They argued that their work is the first academic work that performs automated text-to-text question generation from Turkish texts. Their academic writing provides answering questions as well as generating questions. They believe that generating questions with their answers is more accurate than just generating questions. The reason of using the answers of the questions is to avoid getting too short or long questions. They propose a multitask fine-tuning of the mT5 model [6]. the mT5 is a variant of the T5 model [7], which is a flexible transformer model used in seq2seq NLP problems. According to them, their main contributions are an adaptation of a sentence tokenization pipeline for the highlight input format and benchmarking of the mT5 model for Turkish question generating and answering on the TQuAD (Teknofest 2018 Dataset) dataset in multitasking and single-task settings with different input formats.

Don’t forget. If you need help, “Better Call Barlas’’.

Hope to see you next week. Bye!

by Zeynep Hafsa Dilmaç & Mehmet Berat Ersarı .

References

[1] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In EMNLP

[2] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 (2015)

[3] TQuad (2019) Turkish NLP Q&A Dataset https://github.com/TQuad/turkish-nlp-qa-dataset [accessed 11.09.2022]

[4] XQuAD (Cross-lingual Question Answering Dataset) https://github.com/deepmind/xquad [accessed 11.09.2022]

[5] AKYÖN, FATİH ÇAĞATAY; ÇAVUŞOĞLU, ALİ DEVRİM EKİN; CENGİZ, CEMİL; ALTINUÇ, SİNAN ONUR; and TEMİZEL, ALPTEKİN (2022) “Automated question generation and question answering from Turkish texts,” Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 30: No 5, Article 17.

[6] Xue L, Constant N, Roberts A, Kale M, Al-Rfou R et al. mt5: A massively multilingual pre-trained text-to-text transformer. CoRR 2020

[7] Raffel C, Shazeer N, Roberts A, Lee K, Narang S et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 2020

--

--