Bert: Step by step by Hugging face

Your guide into Bert model

Abdulelah Alkesaiberi
The Startup
6 min readMay 13, 2020

--

source

In this article, we will know what is BERT and how we can implement it, so let us start.

What is BERT?

Bert stands for Bidirectional Encoder Representations from Transformers. It’s google new techniques for NLP pre-training language representation. Which means now machine learning communities can use Bert models that have been training already on a large number of words,(some researchers say the Bert model train on the English Wikipedia 2,500 million words) for NLP models to do a wide variety of tasks such as Question Answering tasks, Named Entity Recognition (NER), and Classification like sentiment analysis.

In Bert paper, they present two types of Bert models one is the Best Base and the other is Bert Large. Both of these models have a large number of encoder layers 12 for the base and 24 for the large. If you understand the concept of transformers. You will see that Bert also trained on the Encoder stacks in the transformers to use the same attention mechanism. But why is it called bidirectional?

What is bidirectional mean?

Because the transformers encoder reads the entire sequence of the words at once which is the opposite of the directional models that read the input sequentially for the left to the right or from the right to the left. The bidirectional method will help the model to learn and understand the meaning and the intention of the word based on its surrounding. Since we will use it for toxic classification, we will explain only the Bert steps for classification tasks only.

What is the input of Bert?

The input of Bert is a special input start with [CLS] token stand for classification. As in the Transformers, Bert will take a sequence of words (vector) as an input that keeps feed up from the first encoder layer up to the last layer in the stack. Each layer in the stack will apply the self-attention method to the sequence after that it will pass to the feed-forward network to deliver the next encoder layer.

What is the output of Bert?

The output of Bert model contains the vector of size (hidden size) and the first position in the output is the [CLS] token. Now, this output can be used as an input to our classifier neural network for classification of the toxicity of the words. In the Bert paper, they achieve a great result by using only a single layer neural network as the classifier.

Now we understand the concept of Bert, we should dig deep into the implementation phase

Data:

our data is from Jigsaw Multilingual Toxic Comment Classification Kaggle competition. In train data, we use just the English language and in the validation and test data we use multiple languages.

files we use:

  • jigsaw-toxic-comment-train.csv
  • validation.csv
  • test.csv

Here we go to the most interesting part… Bert implementation

  1. Import Libraries
  2. Run Bert Model on TPU *for Kaggle users*
  3. Functions
    3.1 Function for Encoding the comment
    3.2 Function for build Keras model
  4. Preprocessing and configuration
    4.1 configuration
    4.2 Import Datasets
    4.3 tokenizer
    4.4 Encode Comments
    4.5 Prepare TensorFlow dataset for modeling
  5. Build the model
    5.1 Build the model
    5.2 Training The Model, Tuning Hyper-Parameters
    5.3 Testing The Model
  6. Predict and store the result

1.import Libraries

2. Run Bert Model on TPU * for Kaggle users*

3.Functions

3.1 Function for Encoding the comment

Encode job is to convert word into vector encapsulate the meaning of the word, similar word has a closer number.

3.2 Function for build Keras model

4.Preprocessing

4.1 configuration

4.2 import dataset

4.3 tokenizer

4.4 Encode Comments

4.5 Prepare TensorFlow dataset for modeling

5.Build the model

5.1 Build the model

5.2 Training The Model, Tuning Hyper-Parameters

5.3 Testing The Model

6. Predict and store the result

Conclusion

Bert is a powerful pre-trained model makes a huge effect on NLP world today. You can use Bert in many different tasks like language translation, question and answer, and predict the next word in addition to text classification.

Acknowledgment

I must say Thanks for my teammates (Sarah and Norah), and for my instructors in DSI7:(Irfan, Husain, Yazied, and Amjad) for helping us to finish this journey :)

Useful references

Contact detail

--

--

Abdulelah Alkesaiberi
The Startup

I have a bachelor in computer science. Interested in data science, ML and computer vision