Week 4— FlashCards

İlkim Aydoğan

Published in

AIN311 Fall 2022 Projects

4 min readDec 11, 2022

Hi, we are a two-student group trying to create an ML model for their AIN311 course.

This is the fourth blog post of our project. Stay tuned for a new post every Sunday.

You can go to the third week from here.

Last week we looked at our datasets and evaluated which way we wanted to take our project. This week we were going to start implementing our model and present the full implementation diagram. But unfortunately, we couldn’t start to implement our project this week because we couldn’t decide which BERT implementation would be best for us.

So this week in our blog post we will explain the basic BERT model without its complicated mathematical formulas.

BERT: The Basics

At it’s core, BERT is a bidirectional language model transformer that was pretrained on BooksCorpus(800M words) (Zhu et al., 2015) and EnglishWikipedia (2,500M words). Therefore the abbreviation stands for Bidirectional Encoder Representations from Transformers.

The model has first published in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018).

Now let’s look at the architecture of the model based on the paper.

BERT: The Architecture

According to the paper “BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers” (Devlin et al., 2018). For conditioning both left and right context in all layers it is built by a stack of multi-layer bidirectional Transformer encoder. The model has three architecture parameter settings:

Number of layers
Hidden size
Number of self-attention heads in a transformer block

BERT pre-training model architecture. Image by Devlin et al.,2018

And because of this design, we can fine-tune BERT with just one additional output layer to create our model for the NLP tasks at hand.

Overall pre-training and fine-tuning procedures for BERT. Image by Devlin et al.,2018

As can be seen from the figures apart from output layers pre-training and fine-tuning phases both have the same architectures. The difference is during fine-tuning we tune all of the parameters.

Now let’s look at how we can fine-tune for our specific task at hand.

Note: Since BERT is a sort of transformer, for more information about transformers in general and their architectures, you can refer to the paper: Attention Is All You Need (Vaswani et al., 2017)

BERT: Usage In QG — QA

To our luck, the people who trained the BERT also fine-tuned it in SQuAD dataset for the Question Answering task. Let’s look at the architecture of the model for this job.

Fine-tuning BERT on Question Answering Tasks: SQuAD v1.1. Image by Devlin et al.,2018

The variety of BERT models

Up until now, we talked about the basic raw BERT. Now let’s see some of its variations specifically for question generation task.

Note: Since we still didn’t decide on our BERT model I will not explain these models in detail. After deciding I will only explain our BERT model in detail.

BERT_base: Devlin et al., 2018
BERT_large: Devlin et al., 2018
DistilBERT: SANH et al., 2020
BERT-QG: Chan et al., 2019
BERT-SQG: Chan et al., 2019
BERT-HLSQG: Chan et al., 2019

Work Plan

This week we looked at BERT models and explained the basic BERT model.

In the next week, we are hopefully gonna decide on our BERT model and going to start implementing our model and presenting the complete implementation diagram. See you.

İlkim İclal Aydoğan

Görkem Kola

References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint: arXiv:1810.04805.

[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.

[3] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint: arXiv:1910.01108.

[4] Ying-Hong Chan and Yao-Chung Fan. 2019. BERT for Question Generation. In Proceedings of the 12th International Conference on Natural Language Generation, pages 173–177, Tokyo, Japan. Association for Computational Linguistics.

[5] Ying-Hong Chan and Yao-Chung Fan. 2019. A Recurrent BERT-based Model for Question Generation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 154–162, Hong Kong, China. Association for Computational Linguistics.