Finetuning BERT with Tensorflow estimators in only a few lines of code

Published in

Serendeepia

5 min readOct 25, 2019

In this article, we briefly introduce (both theoretically and practically) BERT, one of the most recent deep learning models for Natural Language Processing (NLP) that has proven to be quite a revolution in this field.

BERT (standing for Bidirectional Encoder Representations from Transformers) is the first model to use the bidirectional training of transformers in order to solve language-related problems such as Question Answering or Natural Language Inference with quite significant improvements in the results up to this moment. For instance, the GLUE score (diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language) raises to 80.5% (72.8% before) or an F1 in test of 93.2% (90% before) for SQuAD (Question-Answer test).

BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model. Some of them are classification (sentiment analysis, for instance, as we see right below) or Name Entity Recognition (NER).

BERT has been pre-trained on BookCorpus and Wikipedia and requires a specific fine-tuning adapted to the task we are trying to accomplish.

But, how does exactly BERT work?

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. Attention mechanisms were previously used with quite good results but the revolution implied in this model is the loss of directionality. Directional NLP models read the text input sequentially (either from left to right or from right to left). However, BERT reads the whole sentence at once. This feature permits the model to learn the context of a word based on all of the words that are around it and not only the previous or following ones.

The core issue when training language models is to define what we are aiming to predict. A natural approach would be trying to predict the following word based on the previous one: “Yesterday I had pasta for ___”. However, this implies a directional approach and as we said before the aim of this model is to avoid directionality. BERT integrates two training methods in order to overcome this problem: Masked Language Model and Next Sentence Prediction.

Masked Language Model

The model replaces 15% of the words in each sentence by a token (which will be called mask) and then attempts to predict this masked words, based on the context given by the non-masked words. After this, the BERT loss function is computed only taking into account the prediction of the masked values and ignoring the rest. Because of this, the model takes more time to converge than a directional model, but in return, this model is way more aware of the context.

Next Sentence Prediction

The other prediction process is related to context between sentences instead of in sentences. This model receives pairs of sentences and learns to decide if the second sentence is the one that follows the first sentence in the text. The proportion of true and false pairs is 50–50 (ie, half of the pairs presented to the model are formed by consecutive sentences and the other half aren’t).

When training BERT, we aim to minimize the combined loss from both methods, ie, train the model in order to get the most of the masked words right and still keep the context of the order of the sentences.

This two training strategies are responsible for allowing the neural network to learn the context of the sentences. Finally, and depending on the specific task we are aiming to solve, we should add one or two extra layers that are task-specific (a classifier, NER or Question and Answering).

Brief Practical Example

Let’s check an example of Text Binary Classification using BERT. In this example, we are performing Sentiment Analysis over a dataset composed of opinions left by users on IMDB. Films are classified as bad (0) or good (1). The dataset is accessible in the following link.

We will be using the following modules:

We are reusing the tokenizer and model stored at TensorFlow Hub (library for reusable machine learning modules). We implement the following function in order to get the tokenizer from the hub:

We use another function in order to adapt the data input to the features expected by the BERT model:

We define the model we are going to use:

We use the following functions to actually build the model and the estimator. We define the metrics and the behavior of the estimator depending on the action (train, test, predict…):

Finally, we wrap this up in a function that takes the data and the parameters adapting all of the data to the input shape expected by the model, trains BERT over it and then tests it; returning the result of the evaluation metrics we had previously defined:

So now we are ready to fine-tune the model in order to perform the binary classification. We fix the directory where the resulting model will be stored:

OUTPUT_DIR = '/results/BERT_Imdb'

The data file we are using is in pickle format. We read the data from it and load it on two Pandas data frames (one for training and one for test):

We can have a glance and see how this data looks:

train.head()
test.head()

Next, we just need to fix the parameters for the model:

DATA_COLUMN: Name of the column where the input text is stored.
LABEL_COLUMN: Name of the column where the tags are stored.
LEARNING_RATE: Value of the learning rate.
NUM_TRAIN_EPOCHS: Number of epochs the training process lasts.

We use a dictionary to fix them:

myparam = {
        "DATA_COLUMN": "text",
        "LABEL_COLUMN": "sentiment",
        "LEARNING_RATE": 2e-5,
        "NUM_TRAIN_EPOCHS":10
    }

Finally, we are ready to fine tune the model and train it and test it on our data:

Before printing the results, we will define a small function in order to get a nicer output:

Now simply by typing:

pretty_print(result)

We get a table with the result of all the previously defined metrics computed during the test:

Finetuning BERT with Tensorflow estimators in only a few lines of code

Masked Language Model

Next Sentence Prediction

Brief Practical Example

Written by Arturo Sánchez Palacio