The need for Bidirectional Encoder Representations from Transformers (BERT)

Henok Ademtew
3 min readOct 25, 2022

Bidirectional Encoder Representations from Transformers (BERT) is a free and open-source machine learning framework for dealing with NLP. BERT uses the surrounding text to provide context in order to help computers understand the meaning of ambiguous words in text. With the help of question-and-answer datasets, the BERT framework can be adjusted after being pre-trained on text from Wikipedia. Its objective is to produce a language model. We could also say that BERT is a transformer neural network architecture designed for NLP.

Although it is built on deep learning techniques, it needs a lot of processing power to function well. Instead of always trying to train those models from scratch, it is advised to use publicly accessible pre-trained models as a starting point.

Until the birth of BERT, most models could only process text in one direction. However, BERT changed the game by processing the context of a word or in general process the text in two directions from left-to-right and right-to-left approach. We call this Bidirectional. BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text.

So now you can tell that the primary technological advancement of BERT is the application of Transformer’s bidirectional training, a well-liked attention model, to language modeling. In contrast, earlier research looked at text sequences from either a left-to-right or a combined left-to-right and right-to-left training perspective. The study’s findings demonstrate that bidirectionally trained language models can comprehend context and flow of language more deeply than single-direction language models. The authors of the paper describe a unique method called Masked LM (MLM), which makes bidirectional training possible in models where it was previously not practicable. The model’s architecture makes it possible to comprehend words in sentences effectively and contextually.

Now you might ask what are transformers, huh?

Transformer — is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data.

In short, transformer includes two separate mechanisms: decoder and an encoder. The encoder reads the text input, and the decoder produces a prediction.

The primary technological advancement of BERT is the application of Transformer’s bidirectional training, a well-liked attention model, to language modeling. In contrast, earlier research looked at text sequences from either a left-to-right or a combined left-to-right and right-to-left training perspective.

Studies demonstrate that bidirectionally trained language models can comprehend context and flow of language more deeply than single-direction language models. The authors of the paper describe a unique method called Masked LM (MLM), which makes bidirectional training possible in models where it was previously not practicable. The model’s architecture makes it possible to comprehend words in sentences effectively and contextually. BERT was trained by masking 15% of the tokens with the goal to guess them.

Why BERT?

The Transformer, nevertheless, is the first transduction model to generate representations of its input and output only utilizing self-attention, without the use of convolution or sequence aligned RNNs. Meaning that this model makes successful sentence Embedding far more possible than it did previously. In reality, RNN-based designs can have trouble learning long-range dependencies within the input and output sequences and are challenging to parallelize. BERT is the result of an architectural breakthrough as well as the use of this concept to train a network by masking one or more words.

Masked Language Model (MLM): Randomly mask out 15% of the words in the input — replacing them with a [MASK] token — run the entire sequence through the BERT attention-based encoder and then predict only the masked words, based on the context provided by the other non-masked words in the sequence.

BERT training process also uses the next sentence prediction (NSP). During training, the model gets as input pairs of sentences, and it learns to predict if the second sentence is the next sentence in the original text as well.

What makes BERT the ideal choice is that the model is trained with both MLM and NSP. This is to minimize the combined loss function of the two strategies.

--

--