Fine Tuning BERT: Multilabel Text Classification

Published in

PerceptronAI

6 min readJun 24, 2020

It is not news for the NLP community that the state-of-the-art pre-trained language model BERT (Bidirectional Encoder Representations from Transformers) is able to achieve remarkable results in many natural language processing tasks. In general, the industry-wide adoption of transformer architectures (BERT, GPT-2, etc.) made an inclination from the conventional encoder-decoder architectures in sequence-to-sequence tasks such as machine translation to utilizing the pre-trained weights of these language representation models and fine-tuning them to achieve the best results they could possibly achieve.

Based on the works of Chi Sun, Xipeng Qiu, Yige Xu, Xuanjing Huang as published in their paper How to Fine-Tune BERT for Text Classification? we have tried to replicate some of the techniques and not all to achieve the accuracy of 89% on the training set and 87% on the test set.

BERT for text-classification

To recall some of the important features of BERT we have to revisit some important points.

BERT-base model contains an encoder with 12 transformer blocks, 12 self-attention heads, and a hidden size of 768. BERT takes a sequence length of 512 tokens or less. When tokenizing the input BERT transforms the text into two important sequences one is [CLS] and the other is [SEP]. The former contains a special classification embedding and later is used for separating segments.

There are also some other tokenization sequences like [PAD], [UNKNOWN], and [MASK].

For multi-text classification, BERT takes the final hidden of the first token [CLS] as the representation of the whole sequence and returns a probability distribution of the label c:

where W is the task-specific parameter matrix.

Our goal is to fine-tune all the possible parameters to maximize the log-probability of the correct label.

Implementations and Strategies

The idea we approach to fine-tune the model starts with the dataset itself.

Initially, we removed all the unnecessary vocabularies — HTTP links, casing, unwanted spaces, etc — from our data using regular expressions. After preprocessing the data we used workpiece embedding as defined the paper. Since BERT is a pre-trained model we have to use the methods as it used to train the model itself.

1. Workpiece Embedding

Workpiece embedding is a feature vector representation of a word, meaning each word will be mapped to a unique id — which will be later used to do the training by the model. BERT uses a lookup table in fact there are 30K tokens in its vocabulary and 768 features in its embedding — described as rows and columns respectively. So similar words like coffee and tea will be closer to each other than words like guitar and chair.

https://www.shortscience.org/paper?bibtexKey=journals/corr/1802.00400

This is so that when we feed the word embedding into the neural nets it will produce relative results depending on the word embedding. Similar words will produce positive correlations while dissimilar words will produce negative correlations. This helps in grouping sentences into a given class.

https://mostafadehghani.com/2019/05/05/universal-transformers/

In NLP, we trained on a general language modeling (LM) task and then fine-tuned on text classification (or other tasks). This would, in principle, perform well because the model would be able to use its knowledge of the semantics of language acquired from the generative pre-training.
FROM Pre-trained Word Embeddings TO Pre-trained Language Models — Focus on BERT by Adrien Sieg

Since BERT is a pre-trained model it has its own vocabulary on which it was trained. It also uses a technique to break a new and unseen word into subwords in such a way that the subwords are present in the BERT vocabulary.

https://www.researchgate.net/figure/Character-based-embedding-methodology_fig2_321503621

So the idea with word embedding is to be as precise as we can and feed as clean input as we can to BERT. And not to forget it has to be in its standards.

2. Building the model

While building the model we started with the basic model — or the default model. We haven’t added any extra layers to it.

class Model(nn.Module):
    
    def __init__(self, n_classes):
        super(TextClassifier, self).__init__()
        self.bert= BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(768,n_classes)
            
    
    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(
          input_ids=input_ids,
          attention_mask=attention_mask
        )
        output = self.drop(pooled_output)
        return self.out(output)

The accuracy we got from this model was 85%.

The first phase of fine-tuning

In the first phase of fine-tuning, we started by adding an LSTM layer after the dropout layer followed by the linear layer.

self.lstm = nn.LSTM(input_size=embedding_dim,                            hidden_size=hidden_size,                            num_layers=num_recurrent_layers,                            bidirectional=use_bidirectional,                            batch_first=True)

The accuracy dropped by 2%.

The second phase of fine-tuning: Hyparameters

In the second phase of fine-tuning, we started playing with the learning rate and the number of epochs. While playing with the range of the learning rates between 1e-05 to 5e-05 , we could the difference. Eventually adding an extra layer led us to increase the number of epochs from 3 to 6.

We followed the exact methods described in How to Fine-Tune BERT for Text Classification?

We used Adam with beta1 = 0.9 and beta2 with 0.999. We also kept dropout probability to 0.1, base learning rate to 2e-05 and the warm-up proportion to 0.1.

Although with a similar setting we didn’t the same results, the training was slow. Only when increased the epochs we found a peak accuracy of 89% on the training set and 87% on the validation set.

The codes can be found in Nilesh Barla Github repo https://github.com/Nielspace/BERT. The project is on a development phase it will be updated continuously.

The Third phase of fine-tuning: Catastrophic Forgetting

We are still working on the third phase of fine-tuning that is catastrophic forgetting. We will try to cover this in the next part of this blog.

Conclusion: So far

Word embedding is an important step for training a BERT model. Follow all the protocols to clean the text data by removing unnecessary vocabs.
Use libraries such as Regular Expressions and Spacy.
Tokenization is crucial.
Start with a very simple model — if preprocessing is fine you will eventually get good accuracy on both training and testing datasets.
Don’t add extra layers before playing with the learning rates and decay-factors.
Choose the right batch-size and sequence length. Be aggressive at the beginning by considering a small sequence length of 32 or 64 then start increasing the sequence length.

Further Readings:

How to Fine-Tune BERT for Text Classification?
Pre-trained Word Embeddings TO Pre-trained Language Models — Focus on BERT by Adrien Sieg
Universal Transformers

Fine Tuning BERT: Multilabel Text Classification

BERT for text-classification

Implementations and Strategies

1. Workpiece Embedding

2. Building the model

Written by Nilesh Barla