BIDAF VS BERT OBSERVATIONS

PALLAVI PANNU
3 min readJun 29, 2019

ADVANCEMENTS:

Neural Network.

Simple NN was not good at sequential tasks.

RNN was good for short sentences.

RNN was not able to model long term dependencies

LSTM , did this to some extent with the help of forget gates remembering only necessary information , but was not able to do it well too.

Then we used Bidirectional LSTM to read both from right and from left.

ATTENTION!!!!

Then came the concept of attention, that is during translation or in sequential tasks we not only need the output of the final encoder but instead we will have a output coming from every encoder, and we are paying attention to every word .

HOW TO DO THAT????

By adding some weights , attended weights for every word.

Problems:

Hard to parallelize the work for processing sentences since we have to process it word by word.

CNN

CNN helps to solves these problems,each word can be processed at the same time and does not necessarily depend on previous words to be translated.

CNN does not help with dependencies when translating sequences.

Then came the concept of transformers!!!!

Transformers uses self attention.

So BERT is using bidirectional transformers that inherently uses self attention.

ARCHITECTURE COMPARISON

BIDAF ARCHITECTURE

USES LSTM

USES BIDIRECTIONAL LSTM AT THE END.

USES ATTENTION

BERT ARCHITECTURE

USES TRANSFORMERS

USES SELF ATTENTION

USES MULTIHEAD ATTENTION

BIDAF vs BERT

LSTM is not able to capture long term dependencies very well.

Even though we are using bi-directional LSTM, that captures forward and backward context but that is outperformed by the transformers in BERT.

In BIDAF, we are using only attention that is how much attention we have to pay to other words in the sequence.

In BERT, we are using transformers that uses self attention.

Self attention layer helps the encoder look at other words in the input sentence as it encodes a specific word.

BERT

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once.

Therefore, BERT is considered bidirectional, though it would be more accurate to say that it’s non-directional.

This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

BERT is a pretrained model.

The original BERT model was trained for masked language model and next sentence prediction tasks.

MASKED LM

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.

NEXT SENTENCE PREDICTION

token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence

A sentence embedding indicating Sentence A or Sentence B is added to each token.

A positional embedding is added to each token to indicate its position in the sequence.

Why BERT is better than others????

WHICH TASKS BERT CAN PERFORM??

Classification tasks such as sentiment analysis

Question Answering tasks (e.g. SQuAD v1.1)

In Named Entity Recognition (NER), etc.

REFRERENCES

--

--

PALLAVI PANNU

🚀 Data Scientist at KPMG | WiDS Ambassador | Women Techmakers Ambassador at Google🌟