Fine-Tuning BERT model using PyTorch

6 min readDec 23, 2019

This blog is in continuation of my previous blog explaining BERT architecture and enhancements done in NLP. We will implement BERT using huggingface’s NLP library Transformers and PyTorch in Google’s Colab.

Please go through my previous blog “BERT: Bidirectional Encoder Representations from Transformers” to understand the jargon related to BERT. I have tried to keep it simple as far as possible. So lets start with coding.

We will fine-tune the pre-trained BERT model on CoLA dataset. The dataset consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability by their original authors. The public version provided here contains 9594 sentences belonging to training and development sets, and excludes 1063 sentences belonging to a held out test set. It is supervised task, where we will be classifying data as grammatically correct (1) or not (0).

Data Preparation:

Let’s open a new Colab notebook and install huggingface’s transformers library and import necessary libraries that we will require in the BERT model fine-tuning:

We have imported BertTokenizer to run end-to-end tokenization: punctuation splitting + word piece. BertForSequenceClassification is the Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output). BertConfig is the configuration class to store model configurations. AdamW implements Adam learning rate optimization algorithm, it is a type of Stochastic Gradient Descent with momentum. Here momentum is described as the moving average of the gradient instead of gradient itself. get_linear_schedule_with_warmup creates a schedule with a learning rate that decreases linearly after linearly increasing during a warm-up period.

Now as I’m training my model on Colab I usually mount my Google Drive to Colab session to use it as storage. So that I can read my data into Panda dataset. But also you can upload the data into Colab from your local disk. Please follow step which suits you. I usually go with the mounting also because I can store the trained model directly on to my drive for download.

We are using CoLA dataset which you can download from CoLA website. It’s a set of sentences labeled as grammatically correct or incorrect. We will use the raw version because we need to use the BERT tokenizer to break the text down into tokens and chunks that the model will recognize.

Once dataset is downloaded, we have to read the files using panda and do initial cleaning and pre-processing of the data to get it ready for BERT model. We will be adding extra tokens which are only understood by BERT as mentioned in the paper. The same reason why we don’t use any other algorithm to generate word embedding in BERT.

BERT requires specifically formatted inputs. For each tokenized input sentence, we need to create:

Input ids: a sequence of integers identifying each input token to its index number in the BERT tokenizer vocabulary
Segment mask: (optional) a sequence of 1s and 0s used to identify whether the input is one sentence or two sentences long. For one sentence inputs, this is simply a sequence of 0s. For two sentence inputs, there is a 0 for each token of the first sentence, followed by a 1 for each token of the second sentence
Attention mask: (optional) a sequence of 1s and 0s, with 1s for all input tokens (actual words)and 0s for all padding tokens. BERT architecture is based on attention mechanism and this is actual reason for bidirectional behavior of BERT.
Labels: a single value of 1 or 0. In our task 1 means “grammatical” and 0 means “ungrammatical”

We are using “bert-base-uncased” tokenizer model, this model has 12-layer, 768-hidden layers, 12-heads, 110M parameters. It is trained on lower-cased English text. Hence we set the flag do_lower_case to true in BertTokenizer. In BERT paper author specifies tokens like “[CLS]” and “[SEP]” to be added to mark the classification and separation of a sentences. In the final layer [CLS] node is the one whose output is used in classification. These are added during the pre-training phase with extra tokens such as “[MASK]”. We don’t use masking in fine-tuning phase. We are using encode function to convert a string in a sequence of ids (integer), using the tokenizer and vocabulary. In encode function we are converting sentences to tokens based on BERT vocabulary, adding special tokens [CLS] and [SEP] and padding the tokens to MAX_LEN. Padding to MAX_LEN effects the training time and test set accuracy. Try training with MAX_LEN as 128 and 64 and you will see the difference. Now we are ready with our encoded tokens and attention mask. We will use train_test_split scikit-learn method to divide our input Ids to train and validation set. Later convert all arrays to torch tensor, the required data type for our model and then generate a data loader.

Training Phase:

Now we are ready with our data. We are good to start with our training. We will be using huggingface BertForSequenceClassification model, this has a sequence classification/regression head on top (a linear layer on top of the pooled output) for classification tasks. The final layer of the pre-trained BERT model will be trained on our CoLA dataset.

There are BERT-Base and BERT-Large models from the Google paper. Uncased means that the text has been lower cased before Word Piece tokenization, e.g., John Smith becomes john smith. The Uncased model also strips out any accent markers. Cased means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). We will be using “bert-base-uncased” model with BertForSequenceClassification.

For the purposes of fine-tuning, the authors recommend the following hyperparameter ranges:

Batch size: 16, 32
Learning rate (Adam): 5e-5, 3e-5, 2e-5
Number of epochs: 2, 3, 4

As our model is ready, we will feed in the data for it to train. The whole training phase can be summarized as below:

We need to set the BERT model to train mode as the default mode is evaluation(eval).
We iterate over the batch and unpack our data into inputs and labels. Load data onto the GPU for acceleration.
PyTorch by default accumulates the gradients calculated in the previous pass. Clear those previously accumulated gradients.
Forward pass: feeding input data through all the neurons in the network from first to last layer.
Backward pass (back-propagation): counting changes in weights (de facto learning), using gradient descent algorithm (or similar). Computation is made from last layer, backward to the first layer.
Post backward pass network will perform a parameter update optimizer.step function, based on the current gradient (stored in .grad attribute of a parameter) and the update rule.
Then we call scheduler.step function, to update the learning rate.
Unpack our Validation data inputs and labels
Load data onto the GPU for acceleration
Again we will perform Forward pass (feed input data through the network) to predict the outputs.
Compute loss on our validation data and track variables for monitoring progress. We are using sklearn metrics to calculate Matthews correlation coefficient (MCC) as MCC is the metric used by the wider NLP community to evaluate performance on CoLA. With this metric, +1 is the best score, and -1 is the worst score

Please find the code snippet as below:

Testing Phase:

Now we’ll load the “in_domain_dev.tsv” dataset and prepare inputs just as we did with the training set. We were able to achieve 0.549 MCC score in about few training epochs and without doing any hyperparameter tuning (adjusting the learning rate, epochs, batch size, ADAM properties, etc.).

The huggingface transformer documents the expected accuracy for this benchmark here.

You can also look at the official leader board here.

Conclusion:

In this blog, we learnt that using BERT we can create a high quality NLP model within minimal training time and data. To improve the accuracy of our model, we can use BERT large model and tweak the hyper-parameters etc. The code of this entire analysis can be found here.