BERT for Text Classification:

7 min readDec 9, 2021

BERT has been the talk of the town for the last one year. It’s an unique Natural Language Processing (NLP) technique that Google AI Language researchers open-sourced at the end of 2018, and it’s a crucial innovation that’s taken the Deep Learning world by storm thanks to its outstanding performance.

BERT stands for Bidirectional Encoder Representations from Transformers, and it was designed to pretrain deep bidirectional representations from unlabeled text data by conditioning all layers on both left and right context. BERT is generally used for question answering and language inference without extensive task-specific architecture modifications. Its goal is to generate a language model.

BERT is conceptually simple and empirically powerful.

First, I will go over the theoretical aspects of BERT. So, you can think of why BERT is actually needed. Well, the absence of sufficient training data is one of the most significant issues in NLP. Although there is a vast amount of text data available, we must divide it into the many different fields in order to build task-specific datasets. And we only wind up with a few thousand or a few hundred thousand human-labeled training instances when we do this. Alas, deep learning-based NLP models require significantly more data to function well — they observe significant increases when trained on millions, if not billions, of annotated training examples.

To assist in closing this gap in data, researchers have devised a number of strategies for training general purpose language representation models using the immense amounts of unannotated text available on the web(this is known as pre-training). When working with challenges like question answering and sentiment analysis, these general purpose pre-trained models can subsequently be fine-tuned on smaller task-specific datasets. When compared to training on smaller task-specific datasets from scratch, this strategy produces significant accuracy improvements.

That is why BERT is such a major discovery. It allows you to pre-train your models more precisely with less data. Another best aspect of BERT is that it is free to download and use. We may utilize the BERT models to extract high-quality language characteristics from our text data, or we can fine-tune these models on a specific task, such as sentiment analysis and question answering, with our own data to provide state-of-the-art predictions. Since it uses a bidirectional approach, it learns more about a word’s context than if it only trained in one direction. With this additional info, it is able to use another technique known as masked LM.

What is the key idea of BERT?

Basically, language modelling task is to “fill in the blank” based on context. Before BERT , a language model would have looked at the given text sequence during training from either left-to-right or combined left-to-right and right-to-left. This one-directional approach works well for generating sentences — we can predict the next word, append that to the sequence, then predict the next to next word until we have a complete sentence. Now comes BERT, a bidirectionally trained language model. This means we can now get a better sense of language context and flow contrast to single-direction language models.

Moreover, BERT is based on the Transformer model architecture, instead of LSTMs. A Transformer works by repeating a small number of steps over and over again. It uses an attention mechanism in each step to grasp the relationships between all words in a sentence, independent of their position. Given the line “I arrived at the bank after crossing the river,” the Transformer can learn to focus on the word “river” and make this decision in one step.

What is the approach behind it?

BERT is reliant on a Transformer (the attention mechanism that learns contextual relationships between words in a text). A basic Transformer consists of an encoder that reads the text input and a decoder that produces a task prediction. BERT just requires the encoder part because its objective is to construct a language representation model. The encoder for BERT receives a sequence of tokens, which are subsequently transformed into vectors and processed in the neural network. To get BERT to work with your data set, you must first add some metadata. Token embeddings will be required to mark the beginning and end of sentences. To be able to distinguish between sentences, segment embeddings are required. Finally, positional embeddings are required to identify the position of words in a sentence.

**Fig 1: The input representation for BERT: The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.**

We use two different strategies to train the model:

1. Masked LM (MLM):

The concept here is, we randomly mask 15% of the words in the input with a [MASK] token then run the full sequence through the BERT attention based encoder and forecast just the masked words given on the context provided by the other non-masked words in the sequence. This basic masking strategy, however, has a flaw: the model only tries to predict the correct tokens when the [MASK] token is present in the input, while we want the model to try to predict the proper tokens regardless of the token present in the input. To address this issue, among 15% of the tokens chosen for masking:

→80% of the tokens are actually replaced with the token [MASK].
→Tokens are changed 10% of the time with a random token.
→The tokens are left unchanged 10% of the time.

During training, the BERT loss function only considers masked token predictions and ignores non-masked token predictions. As a result, the model converges far more slowly than models that are left-to-right or right-to-left.

2. Next Sentence Prediction (NSP):

The BERT training procedure also uses next sentence prediction to understand the relationship between two sentences. For tasks like question answering, a pre-trained model with this level of knowledge is useful. During training, the model is given pairs of sentences as input and is taught to predict if the second sentence is the same as the next sentence in the original text.

BERT uses a specific [SEP] token to separate sentences, as we saw earlier. The model is fed two input sentences at a time during training, as follows:

The second sentence appears 50% of the time following the first.
It’s a random sentence from the corpus 50% of the time.

To determine if the second phrase is connected to the first, the entire input sequence is passed through a Transformer-based model, the output of the [CLS] token is transformed into a 2×1-shaped vector using a simple classification layer, and the IsNext-Label is assigned using softmax.

Both Masked LM and Next Sentence Prediction are used to train the model. The goal is to reduce the combined loss function of the two techniques — “better together.”

Architecture of BERT:

BERT also offers four types of pre-trained versions, depending on the scale of the model architecture. They are:

BERT-Base, Uncased: 12-layers, 768-hidden, 12-attentionheads, 110M parameters.

BERT-Large, Uncased: 24-layers, 1024-hidden, 16-attention heads, 340M parameters.

BERT-Base, Cased: 12-layers, 768-hidden, 12-attentionheads , 110M parameters.

BERT-Large, Cased: 24-layers, 1024-hidden, 16-attentionheads, 340M parameters.

I recommend reading the original paper for additional information on the hyperparameter as well as the architecture and results breakdown.

Now that we’ve seen the fundamentals of BERT, let’s look at a real-world application. For this guide, I’ll be utilising the which you can access here.

To begin coding, I’m first installing ktrain as it is not installed by default in Google Colab, and then importing some other libraries like os.path, which is used when loading the IMDB dataset, and then importing numpy, tensorflow, and ktrain libraries.

Importing the libraries

!pip3 install ktrainimport os.path
import numpy as np
import tensorflow as tf
import ktrain
from ktrain import text

Part 1: Data Preprocessing

Loading the IMDB dataset

The dataset is chosen directly from stanford.edu website, then dataset is loaded by using keras library through get_file. Next we get into directory folder path leading to dataset and then printing it.

dataset = tf.keras.utils.get_file(fname="aclImdb_v1.tar.gz",origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",extract=True)IMDB_DATADIR = os.path.join(os.path.dirname(dataset), 'aclImdb')print(os.path.dirname(dataset))
print(IMDB_DATADIR)

Creating the training and test sets

Here we use texts_from_folder() function with which we’ll get the training set , the test set and the pre processing mode.

(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder(datadir=IMDB_DATADIR,
                       classes=['pos','neg'],
                       maxlen=500,
                       train_test_names=['train','test'],
                       preprocess_mode='bert')

Part 2: Building the BERT model

In this line of code , model will collect what will be received by the text_classifier function which is from text module and builds the BERT model.

model = text.text_classifier(name='bert',
                             train_data=(x_train, y_train),  
                             preproc=preproc)

Part 3: Training the BERT model

This code contains get_learner() function which is taken from ktrain library and train the BERT model.

learner = ktrain.get_learner(model=model,
                             train_data=(x_train,y_train),
                             val_data=(x_test, y_test),batch_size=6)
                             learner.fit_onecycle(lr=2e-5,epochs=1)learner.fit_onecycle(lr=2e-5,epochs=1)

This will be the output of the BERT model. We get 94.01% accuracy for IMDB Dataset.

Conclusion:

BERT is a very sophisticated language representation model that has been a significant milestone in the field of NLP — it has substantially expanded our capacity for doing transfer learning in NLP. BERT is still very new, having only been launched in 2018, yet it has already proven to be more accurate than prior models, despite being slower. If you want to get into BERT model just go through the original paper.