BERT — Pre-training + Fine-tuning

Dhaval Taunk
Analytics Vidhya
Published in
3 min readDec 26, 2021
Source — https://ruder.io/content/images/2021/02/fine-tuning_methods.png

Huggingface.co has made using the transformers-based model convenient with their Transformers API. But a lot of time, only fine-tuning does not work. Pre-training on the unlabelled data and then fine-tuning helps the model achieve the desired results. Huggingface API provides the pre-training functionality as well. In this blog post, I will be explaining how to perform pre-training and then fine-tuning a transformers based model. For this purpose, I will be using BERT as a reference model.

Data Formatting

To perform pre-training, the data must be in a specific format. It should be in a text file (.txt format) with one sentence per line. The purpose of this text file is first to tokenize the data using Word Piece tokenizer and then perform pre-training on the data.

Pre-training model

Train tokenizer on the text

After converting the data in the required format, the next step is to train the tokenizer on input data. This step is helpful to create the vocabulary of the data. The below code gist shows how to tokenize the text using Word Piece Tokenizer. To read more about Word Piece Tokenizer, you can refer to section 4.1 from the below link:-

https://arxiv.org/pdf/1609.08144v2.pdf

Train BERT for MLM task

The next step will be to pre-train BERT for the masked language modelling task. For this purpose, we will be using the same dataset we used to train the tokenizer for this purpose. For the MLM task, 15% of tokens are randomly masked, and then the model is trained to predict those tokens. This functionality is present in the Huggingface API, which is given in the below code:-

Till now, we are done with the pre-training part. Let's move to fine-tuning part.

Finetuning Model

Data Preparation

For the fine-tuning section, the data must be in a different format from what we used in the pre-training part. BERT takes three inputs viz. — input_ids, attention_mask, token_type_ids. I won't be going into the details of what are they. You can refer to them from the BERT paper. Here, I will be explaining how to calculate them from the Huggingface API. I will be using the BERT model for classification purposes here. One can make changes in their code according to their convenience.

In the above code, I have used the Dataset class from torch.utils and BERT's tokenizer to convert the data in the required format. The in the next step, I am creating a DataLoader class for training and testing purposes.

Model Defining

Let's start with the model-building part now for the fine-tuning purpose. I will be adding two linear layers on top of BERT for the classification purpose with dropout = 0.1 and ReLU as an activation function. One can try different configurations as well. I have defined PyTorch class to build the model which is there in the below code:-

Train and validation function

The last step is to define the training and validation function to perform fine-tuning. This will be a usual function that is used in PyTorch by everyone. The below code depicts this:-

Voila, now you are done with all the required steps to achieve the goal. But one can try different configurations as mentioned above. Also, you can try for a different task than classification as mentioned above. If you want to complete the code, you can visit the below link:-

This is all from my side this time. If you want to read more related to ML/DL, visit the below link and if you like, do give it a clap.

If you liked my article:

--

--

Dhaval Taunk
Analytics Vidhya

MS by Research @IIITH, Ex Data Scientist @ Yes Bank | Former Intern @ Haptik, IIT Guwahati | Machine Learning | Deep Learning | NLP