TensorFlow 2 -BERT: Movie Review Sentiment Analysis

BERT stands for Bidirectional Encoder Representations from Transformers. A pre-trained BERT model can be fine-tuned to create state-of-the-art models for a wide range of NLP tasks such as question answering, sentiment analysis and named entity recognition. BERT BASE has 110M parameters (L=12, H=768, A=12) and BERT LARGE has 340M parameters (L=24, H=1024, A=16)(L stands for the number of layers, H for the hidden size and A for the number of self-attention heads) (Devlin et al., 2019).

BERT model architecture is a multi-layer bidirectional Transformer encoder (see Figure 1). The authors of BERT paper pre-train the model with 3.3 billion words in the two NLP tasks: Task #1: Masked LM and Task #2: Next Sentence Prediction (NSP).

Figure 1: BERT Architecture — BERT representations are jointly
conditioned on both left and right context in all layers (Devlin et al., 2019)

BERT model has an interesting input (see Figure 2) representation. Its input is the sum of the token embeddings, the segment embeddings and the position embeddings (Devlin et al., 2019)

Figure 2: BERT model input: token, segment and position embeddings

Dataset

IMDB Dataset from Kaggle has 50K movie reviews for natural language processing. The dataset in CSV format has two columns: review and sentiment. For polarity, a review is either positive or negative. Hence, we have got a binary classification problem in the supervised learning setting.

Data preprocessing

The important part of data prepossessing is how to construct specific BERT input embeddings. The functions in the following code block are for purposes of 1) transforming a review to the three embeddings, and 2) formatting inputs that can be consumed by the model in training and testing. We set the maximum sequence length to 500.

Modelling

To build a state-of-the-art NLP model for solving the sentiment analysis problem, we select BERT BASE as the pre-trained model. We add one fully connected layer which has 768 ReLU activation units and dropout = 0.1. We also add an output layer which has two softmax functions that is the same approach as the google-research TensorFlow 1- BERT Tutorial.

We shall create a model instance before we start to fine tune the model in a training cycle. To do that, we pass the url: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1 to the nlp_model function. Then, let us take a closer look at the model’s summary.

Table 1: nlp_model summary

Model training

During the model training, we use Adam optimizer with a learning rate of 2e-5 to minimize the categorical_crossentropy loss, and these hyperparameters are the same as TensorFlow 1- BERT Tutorial. Although the fine-tuning stage model training takes time, the benefits of better model performance outweigh the computational costs. The model training time for one epoch is approx. 47 minutes in Colab Pro with 1 GPU. Awesome! Just after one epoch training, the nlp_model has already achieved 94% accuracy.

Results

Please note that due to the computational resource constraint, I have not conducted 10-fold cross-validation. Therefore, the 94% accuracy may be slightly different from the average accuracy over 10-fold cross-validation.

The notebook is accessible at this link.

Reference

Devlin, J., Chang, M., Lee, K. and Toutanova, K., 2019. BERT: Pre-Training Of Deep Bidirectional Transformers For Language Understanding. [online] Arxiv.org. Available at: <https://arxiv.org/pdf/1810.04805.pdf> [Accessed 19 May 2020].

--

--