BERT fine-tuning for Tensorflow 2.0 with Keras API

5 min readDec 13, 2019

Motivation

As a machine learning engineer, I’m always involved in different ML problems. One of these kept my attention, especially thanks to the experiences gained during my university and work career.

I’m just talking about the Natural Language Processing. The 2019 was a turning point in the NLP field, especially thanks to the definition of the feature-based training of ML models [1].

In order to stay updated on these techniques, I started studying further the Machine Learning model BERT [2].

Since in my daily work I often design and develop ML architectures using the Tensorflow library, especially using its Keras layer, I started exploring a more efficient way to use a pre-trained Bert model and adapt it on daily working problems.

Searching for information across blogs and other internet sources, I found really few examples on how to use a pre-trained Bert model as a Keras layer and use it for the fine tuning process on different types of data.

In this article I want to show how to use a pre-trained multilingual Bert model [3] and apply the transfer learning technique in order to fit it to a different problem.

In the company I’m currently working for, we deal with different types of Machine Learning problems, ranging from commons Chatbot and Voice Bot problems to more sophisticated classification problems and entities recognition problems. Therefore, I decided to focus my attention on how to use the multilingual pre-trained BERT model in Tensorflow 2.0 through Keras API, to apply the transfer learning technique and use the extracted features into a new Deep Neural Network model developed to fit a new task.

I started by taking a sample of data generated from the most frequent questions collected during the different projects in which I was involved.

I looked into the GitHub repo articles in order to find a way to use BERT pre-trained model as an hidden layer in Tensorflow 2.0 using the Keras API and the module bert-for-tf2 [4].

After reading papers and understanding better the model, I finally wrote a RNN with BERT embedded in a Keras Layer.

Purpose of the article

In this article I would like to share a simple user guide and the implementation guide of the BERT model as a hidden layer in a Deep Neural Network in order to execute the fine tuning technique on a specific problem.

In this case I will show how to perform fine tuning on a classification problem with 207 sentences in Italian language belonging to 30 different types of FAQs.

Prerequisites

I structured this article by taking into account that the readers already have:

Good knowledge of Python
Knowledge of Tensorflow 2.0 Keras
Knowledge of text tokenization and Natural Language Processing concepts
Good knowledge of Deep Neural Networks
Good knowledge of Recurrent Neural Networks
Knowledge of BERT model

Data

The starting csv file contains 207 sentences belonging to 30 different classes. For each class I have at least 5 different examples.

I decided to put away a couple of examples from each class in order to define a validation set and use it later, during the training part of the model.

Preprocessing

As described into the Pypi documentation [5], the BERT layer requires in input an array of sequences with a defined max length for each sequence.

First of all, I have created an instance of the BERT FullTokenizer, that requires in input the corpora used for training the BERT model.

From the GitHub repository [3] I have downloaded the multilingual pre trained model and I put it into a directory.

The downloaded model contains different files, one of these is the vocab.txt required from the FullTokenizer class.

Using the tokenizer, I prepared my data following these steps:

Split data into training set, training labels, testing set and testing labels
Shuffled each sentence set
Tokenized each sentence set using the tokenizer described above
Appended the “[CLS]” and “[SEP]” tokens at the beginning and at the end of each sequence. As BERT model requests, token “[CLS]” stands for class and has to be placed at the beginning of the input example. “[SEP]” token is for separating sentences for the next sentence prediction task.
Performed one hot encoder to each label of the label set

BERT Layer

Now it’s time for creating the BERT layer for my Deep Neural Network. As described into the PyPi documentation [5], I’ve used the BertModelLayer wrapper in order to create the Keras Layer.

Into the downloaded pre trained model there is the file bert_config.json which contains all the parameters required for creating the layer.

Since I wanted to use the pre trained model as is without re-training it, I decided to freeze all the original layer wrapper into the BertModelLayer class. I made this choice because I did not have so much data.

Model

This is the final part where I will show the model definition in order to perform the fine-tuning process on my training data.

I will not describe the hyperparameter tuning phase since it is problem specific and therefore it does not add further value to this training.

The definition of the first two layers is clearly described into PyPi documentation of the module bert-for-tf2 [5].

The parameter max_seq_length was defined during the data analysis phase. Since the average length of each sequence belonging to the training set is 48.2, I decided to put max_seq_length equals to 48, as done in the tokenization part described above.

I have excluded the maximum length of a sequence in the training set since I did not want to add noise during the training. I decided to use an average value because in this way, during the training phase, there would not have been too much information loss from the input sequence.

I decided to put two Dense layers after my BERT embedded layer with 256 neurons each.

For the regularization part, I put two Dropout layers with 0.5 as regularization parameter.

For the optimizer function, I’ve used the Adam function with a learning rate equals to 0.00001.

Summary

Since on each epoch I saved the checkpoint, I trained the model for roughly 200 epochs with these results:

loss: 0.232
accuracy: 0.957
val_loss: 0.496
val_accuracy: 0.897

Conclusions

In this article I have shown how to load a pre trained model from the official GitHub repository [5], embed it into a Tensorflow Keras layer and use it into a Deep Neural Network for fine tuning it on my own dataset to solve this specific task.

Starting from these results, in the future I will examine how these type of word embedding using Transformers will work in NLU tasks and in Named Entity Recognition tasks.

Thanks for your attention.

[1] https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

[2] https://arxiv.org/abs/1810.04805

[3] https://github.com/google-research/bert/blob/master/multilingual.md

[4] https://pypi.org/project/bert-for-tf2/

[5] https://github.com/google-research/bert