NLP with R part 5: State of the Art in NLP: Transformers & BERT

Published in

Cmotions

23 min readNov 13, 2020

This story is written by Jurriaan Nagelkerke and Wouter van Gils. It is part of our NLP with R series ‘Natural Language Processing for predictive purposes with R’ where we use Topic Modeling, Word Embeddings, Transformers and BERT.

In a sequence of articles we compare different NLP techniques to show you how we get valuable information from unstructured text. About a year ago we gathered reviews on Dutch restaurants. We were wondering whether ‘the wisdom of the crowd’ — reviews from restaurant visitors — could be used to predict which restaurants are most likely to receive a new Michelin-star. Read this post to see how that worked out. We used topic modeling as our primary tool to extract information from the review texts and combined that with predictive modeling techniques to end up with our predictions.

We got a lot of attention with our predictions and also questions about how we did the text analysis part. To answer these questions, we explain our approach in more detail in a series of articles on NLP. We didn’t stop exploring NLP techniques after our publication, and we also like to share insights from adding more novel NLP techniques. More specifically we will use two types of word embeddings — a classic Word2Vec model and a GLoVe embedding model — we’ll use transfer learning with pretrained word embeddings and we use transformers like BERT. We compare the added value of these advanced NLP techniques to our baseline topic model on the same dataset. By showing what we did and how we did it, we hope to guide others that are keen to use textual data for their own data science endeavours.

In our earlier articles we extracted topic models and word embeddings from our review texts. We showed how well these topic models were in predicting Michelin reviews. And finally, we showed how these predictions could be improved substantially using the word embeddings to predict Michelin stars.

In this article we introduce Transformers and show you how state of the art NLP techniques like BERT (Dutch versions BERTje and RobBERT and multilingual distilBERT) can be used to transform our review text in context dependent representations and how to use those in a downstream prediction task.

Transformers — what’s so special?

In our previous blogs, we introduced a number of word embedding techniques. We showed how word embeddings translate words into numeric vectors that represent the meaning of words. Some time ago, word embeddings were the state of the art in NLP, but then came Transformers and some say this has revolutionized how data scientists work with textual data. What are transformers and why are they so special?

The importance of the sequence & context in texts

To really understand a sentence or piece of text, it’s essential to look at the words in the text in relation to the surrounding words. Text is sequential data and for many use cases this ordering is important. It’s not always crucial though! Remember our Topic Model to derive topics in the restaurant reviews? Here we fully neglected the order and treated all reviews as ‘bags of words’. This resulted in topics that are easily understood and the distribution over topics proved to be quite powerful in predicting Michelin vs non-Michelin reviews. Also, our word embeddings didn’t take the exact order of words into account when predicting Michelin reviews even better than we could with topic scores. However, in more complex tasks — for instance when you want to translate a review, reply to the review or generate other meaningful, human-like texts — the sequence does matter a great deal. And possibly, our ‘simple’ prediction task also improves when we use a model with deeper ‘understanding’ of the review than we did so far with our topic model and word embeddings! We’ll see…

Before BERT: Transformers

In the field of Neural Networks, dramatic improvements are made in the past decade how to deal with sequential data — time series, audio & video and textual data to name a few. Within textual data, most progress is made in developing highly accurate sequence-to-sequence models, used in domains such as speech recognition, machine translation, text summarization and question answering. Recurrent and Convolutional Neural Network architectures and later variants (LSTM, GNU) pushed performance in several NLP tasks, but at growing computational costs and training time. Sequential models are not very suitable to parallelize and distribute over multiple cores… Transformers have set a new standard by both improving performance of these recurrent neural network configurations by better dealing with long range (more distant) dependencies between words and also improving performance dramatically by enabling more parallelization in training. For more details on Transformers: the seminal paper introducing Transformers is the ‘Attention is all you Need’ article mainly developed at Google Research. Although these Transformer models already showed groundbreaking improvements in results in many NLP tasks compared to its predecessors, its variant BERT lifted the bar even further. For all the details on BERT, read the original paper or this nice blog about BERT.

After BERT: GPT-2, XLNet, GPT-3 …

Developments in NLP haven’t stopped after BERT: GPT-2, XLNet and more recently GPT-3 were introduced with stunning performance on NLP tasks. In this blog we focus on BERT for a number of reasons. First of all, much of the increased performance reported are in the domain of text generation (completing texts, question answering, conversational contexts) where the focus is on how ‘human’ the generated texts are. In our context, we want to emphasize on a context where you can utilize NLP techniques in other downstream tasks like predictive modeling. Secondly, these improvements also come at a price: parameters to be trained! GPT-2 has 1,5 billion parameters and GPT-3 tops this with 175 billion parameters to train! Our BERT model only has 109 million parameters, already quite a carbon footprint. Finally, if you do want to use these later models, this is only a small change to approach we take in our blog. You just take another model from the Hugging Face model repo we will introduce later on and you’re ready to go. You might need some extra GPU/TPU to run these, though…

Who’s BERT?

So who’s BERT? The name BERT is more than a fun reference to Elmo, the deep contextualized word representations and one of the main predecessors of BERT. In fact it’s an acronym that stands for Bidirectional Encoder Representations from Transformers. As its name suggest, it’s still a Transformer model, but one that is trained looking in both directions (left-to-right and right-to-left) when processing a text sequence, whereas prior techniques had a unidirectional approach (left-to-right or right-to-left). And BERT results in high value pretrained text representations, that can be finetuned for the NLP task at hand. Whereas by reading the papers on BERT it becomes clear how BERT works, it’s not that obvious why it performs so well. This paper does a nice job in summarizing what is known on BERT so far and provides a better, deeper understanding of BERT without drowning the reader in too many technical details (still quite some, though).

How BERT is trained and used

The training of BERT is done in two ways: First, random words in the sentences from the training data are masked and the model needs to predict these words from its context. Secondly, half of all subsequent sentences in the training data are swapped and the model has to figure out which are in the right order and which are swapped. In later variants of BERT like RoBERTa and DistilBERT these two pretraining tasks are somewhat altered and performance is further improved. For all its variants, training results in the pretrained model that can be downloaded and used to finetune for another NLP task. And that’s what we’ll do in this article.

Until Transformer models appeared, training and using NLP models was available to the happy few that had access to huge computational resources, relevant data and budgets. Transformer models are pretrained on massive datasets and are made available for downloading and finetuning, requiring only a fraction of resources needed for finetuning compared to pretraining. Therefore, many are able to (re)use the greatness of models trained on huge datasets at enormous costs in their NLP tasks with minimal investments needed in resources to finetune these models.

Main advantages of BERT are that it is a general purpose model that can handle an arbitrary length as input, is already pretrained and is available to everyone. Previous NLP models usually were trained for a specific task at hand, BERT however can be used for various NLP tasks, it only needs to be finetuned for the specific task. This saves the NLP practitioner loads of time and money on GPUs and data collection to train the full Transformer model.

BERT for other languages

Although machine translation is an important use case for many NLP models, most emphasis in developing state of the art NLP techniques is put on English texts. For our analysis, we are interested in Dutch Restaurant Reviews and need a language model that performs best on Dutch words. Fortunately, BERT comes with a multilanguage variant, optimized for numerous languages. And there are some custom BERT variants, such as BERTje and RobBERT, further optimized for Dutch. We’ll have a look at these alternatives when using BERT to predict Michelin stars with our restaurant reviews.

Before we start: Let’s have a bit of python with our R!

So far, we’ve used R and R only in our NLP endeavors to translate review texts into value. We did so for a reason. Despite python’s popularity and despite we also use python a lot in many of our day to day Data Science tasks, we still believe R has much to offer for data scientists — also in the field of text analytics and deep learning — and we know there’s a great community of R users out there, looking for interesting use cases in R. We also see that more options become available to combine the greatness of both R and python; reticulate is one of them. This package enables you to use python code in R almost seamlessly.

We use the reticulate package here, since currently one of the easiest ways to use Transformer models including BERT is to use the python library transformers. This great piece of work developed by Hugging Face 🤗 provides many pretrained models, datasets, APIs, tutorials and much more. We will use it here to get access to pretrained BERT models for the Dutch language and to finetune those to build our prediction models.

We are going to be using a large pretrained model and need all the GPU power we can get in this notebook. In Azure Databricks we’ve set up two ‘NVIDEA Tesla K80’ GPUs that have a memory size of 11GB and a bandwidth of 223GB/s. Below we make sure that GPU input is managed well and we utilize both GPUs.

We’re ready to start using BERT and prepare it for finetuning on our prediction task to predict Michelin reviews like we did with our topic modeling results and our word embedding results. As you see below, when you load a python module in R with reticulate, it works like any other object in R, you can call the contents of the module, mainly its methods and functions, using the $ sign, as you would do if you want to get something from any other R object. Here, we start by loading the module using the reticulate import-function and we use its method to download the model from the transformers package. We start by using BERTje, a model built for the Dutch language, developed by Wietse de Vries. BERTje was trained on high quality Dutch text that include books, news corpora, news web pages, Wikipedia and SoNaR-500, a 500-million-word reference corpus of contemporary written Dutch.

After loading BERT — or BERTje which has the same configuration — we can explore how BERT works. The key elements we need are the tokenizer and the pretrained model. The tokenizer that comes with the model will be used to look up the Dutch words in our restaurant reviews and map them to the token ids needed by BERT. Aside from these token ids, the tokenizer also adds some generic tokens to the text sequences:

[UNK] for tokens in the text sequence that are not in the vocabulary of 30.000 tokens used for training,
[CLS] is a special token used to determine the start of each sequence
[SEP] is a separator token used to separate parts within a sequence (sentences, question/answers)
[PAD] is the token used to fill sequences that are shorter than the specified sequence length used in the model
[MASK] is the token used in pretraining, when a sample of tokens is masked to train the Masked LM model

This visualization from the original paper shows how BERT tokenizes the textual data for pretraining:

The [MASK] and [SEP] tokens are essential for pretraining BERT. [SEP] tells BERT what the first sentence and what the second sentence is in the provided text. Remember that training of BERT is done in two ways: by learning what are words that are randomly masked before training (the [MASK] tokens that are added before pretraining) and by figuring out what is the right order of the sentences within the text. Since we will not redo the pretraining of the model, we will not need to split our texts into sentences divided by the [SEP] token and we don’t need to add any [MASK] tokens as well. We can use the tokenizer we need (each model has its own way to tokenize the text to fit the model structure) from the Transformers package. Thereafter, we can load our own textual data and apply the tokenizer to it.

$`[UNK]`
[1] 0

$`[CLS]`
[1] 1

$`[SEP]`
[1] 2

$`[PAD]`
[1] 3

$`[MASK]`
[1] 4

$bekommerd
[1] 9000

$bekoorlijk
[1] 9001

$bekostig
[1] 9002

$bekrachtigd
[1] 9003

$bekritiseerd
[1] 9004

$bekroning
[1] 9005

Load preprocessed data

Let’s load the restaurant review data we’ve prepared in an earlier blog and while we’re at it, let’s also load the labels and the same ids we want to use for training and testing models in all our NLP blogs.

reviews.csv: a csv file with review texts — the fuel for our NLP analyses. (included key: restoreviewid, hence the unique identifier for a review)
labels.csv: a csv file with 1 / 0 values, indicating whether the review is a review for a Michelin restaurant or not (included key: restoreviewid)
trainids.csv: a csv file with 1 / 0 values, indicating whether the review should be used for training or testing — we already split the reviews in train/test to enable reuse of the same samples for fair comparisons between techniques (included key: restoreviewid)

Tokenize reviews

Now that we have our textual data, we can have a look at the tokenizer at work. Let’s tokenize an example from our review text to see what happens. In previous NLP blogs we use the reviewTextClean column of our dataset. This text was completely cleaned from interpunction, stopwords, abbreviations etcetera. Here we follow a different approach as word order matters. So we are keeping as many words as possible and remove interpunction we do not need.

# original text of sample review:
[1] "Zeer goed eten op een ruime locatie. Vrij parkeren voor de deur was ideaal. Zijn bekend met de Indiase keuken sinds 1985. Kwaliteit is zeer goed en hoeveelheid klopt. Jonge dame heeft ons goed ontvangen en op haar advies het buffet gezien. Geen spijt van gehad want het was super. Lekkere Indiase wijn erbij. Inrichting is strak en fris. Prima aanrader."

# tokenized sample review:
 [1]  1 7769 25138 12780 11380 16804 11130 18463 15130 13 7466 17277
[13] 21877 10537 10642 22250 13565 13 7798 8971 15557 10537 3577 117
[25] 14212 18935 363 13 4171 26895 117   121   132 13903 22679 12780
[37] 11281 13340 14378 13 3787 10511 13117 16563 12780 16671 11281 
[49] 12989 8201 13261 9952 25258 132 12640 13 2838 19278 20722 12045
[61] 22231 13261 22250 19774 13 4295 117 3577  117 22468 11320  13
[73]  3570 28145 13903 19646 11281 11741 13 5739 7862 27990 25138 13
[85]     2

We can see that the tokenizer results in a list with the text translated in the token indices, starting with the id for the CLS token (1) and finishing with the id for the SEP token (2). Notice that we only see one SEP token here and that points at the end of sentences (.) are represented by token id 13, which is treated the same as any other normal token. During pretraining, texts needed to be split in different sentences using this SEP token to predict sentence order. For finetuning, we don't have to train this task, therefore it is not needed to specify segments within the text sequences. If you would like to fully retrain BERT including the next sentence prediction task, you would need to provide an input for the tokenizer where sentences are separate list items. In that case, in the output you can see the SEP token halfway:

[1]  1 7769 25138 12780 11380 16804 11130 18463 15130 13 7466 17277
[13] 21877 10537 10642 22250 13565 13 2  7798  8971 15557 10537 3577
[25] 117 14212 18935 363 13 2439 22264 22777 21441 22828 19883  9314
[37] 13 2

What’s good to know is that improvements on BERT (specifically Roberta) have shown that the next sentence prediction (NSP) task is actually not needed. Therefore, later BERT variants focus in pretraining on the Masked Language Model (MLM) task.

What’s also good to know is that BERT doesn’t only look up words in the token dictionary but also splits complex (in fact unknown) words into subwords: BERT uses a WordPiece tokenization strategy. If a word is out-of-vocabulary (OOV), then BERT will break it down into word pieces it does know. For instance the word ‘aspergesoep’ (EN: Asparagus soup) might not be in the 30K vocabulary, but its parts ‘asperge’ and ‘##soep’ are. And as a result, sequences are likely to become somewhat longer after tokenizing than the original sequence, also because BERT adds some extra functional tokens like the beginning of a sentence [CLS], a separator between sentences [SEP] or a padding token [PAD]. Good to keep this in mind when we have to specify the max length of input texts for BERT!

# tokenize a complex word: aspergesoep  
[1]     1  8600 28568     2

# lookup the token ids in the dictionary:
[1] "[CLS]"   "asperge" "##soep"  "[SEP]"  

# decode word back to original:
[1] "[CLS] aspergesoep [SEP]"

Below is an example of how a sentence is encode into id’s and how is can be decoded back to its original form.

[1] "[CLS]" "ik"    "wil"   "dit"   "vest"  "graag" "ruil"  "##en"  ","    
[10] "het"   "is"    "te"    "klein" "[SEP]"
[1] "[CLS] ik wil dit vest graag ruilen, het is te klein [SEP]"

Before we can finetune BERTje with our restaurant review texts in the task to predict which of those are reviews for Michelin restaurants and which are reviews for non-Michelin restaurants, we need to tokenize our textual data. We split our files into train and test datasets with the same mapping as we did earlier (identical IDs).

Next we apply the BERTje tokenizer to the reviewText field of all reviews. We need to specify the maximum review length here, since the tokenizer will cut off too long reviews and add padding to make all reviews the same length. In our earlier NLP blogs, we used a maximum review length of 150 tokens. Remember that BERT adds some special tokens to the sequence and splits complex words into multiple tokens. Also we did not exclude stopwords from the corpus, so the length of reviews increased. We increase the max_length for BERT therefore to 250 tokens.

Use BERT model without finetuning

In a minute, we will finetune BERTje for our classification task. But first, let’s show why finetuning is a good idea. You might wonder: If BERT is so great in translating full texts of different lengths into embeddings that can be used for different downstream tasks, why not just get those embeddings and use them for predictions? Not a bad idea, this saves you on expensive finetuning. Yes, finetuning is much faster and cheaper than training a BERT model from scratch, but still takes some GPU’s and computation hours. So, before we do finetune BERTje for our predictions, we just run our texts through the pretrained model and extract the embeddings for the full text — hence the embeddings for the CLS token:

Now, we have transformed all our different-length review texts into same-length (768) numeric vectors we can use in any model we want. Next we setup a simple Keras sequential model taking these CLS embeddings as our only input. The input share is the batch size x 768 columns from the last BERT layer. We apply some additional hidden layers for training.

Model: "sequential_1"
____________________________________________________________________
Layer (type)                        Output Shape                    Param #     
====================================================================
dense_12 (Dense)                    (None, 100)                     76900       
____________________________________________________________________
dense_13 (Dense)                    (None, 40)                      4040        
____________________________________________________________________
dropout_78 (Dropout)                (None, 40)                      0           
____________________________________________________________________
dense_14 (Dense)                    (None, 1)                       41          
====================================================================
Total params: 80,981
Trainable params: 80,981
Non-trainable params: 0

loss: 0.4170 - acc: 0.9057 - auc: 0.9712 - val_loss: 0.3796 - val_acc: 0.8656 - val_auc: 0.8818

predicted
actual     0     1
     0 36950  5539
     1   345   956

Accuracy: 87% of Michelin/non-Michelin review predictions r correct
Precision: 15% predicted Michelin reviews are real Michelin reviews
Recall: 73% of all actual Michelin reviews are predicted as such
F1 score: 0.25  is the weighted average of Precision and Recall

The model reaches an AUC of .89, certainly not better than the word embedding models we used in our previous article. To put things in perspective the plot below shows the performance of the best word embedding model, the random forest model using topic modeling and the BERTje model without additional training. In our earlier NLP posts we have introduced modelplotr, a package that can display insightful plots for multiple models at once. These plots are all based on the predicted probability distribution instead of the ‘hard’ prediction based on a cutoff value. Let’s explore how well we can predict Michelin reviews with the models built with BERTje compared to the best Word Embedding model and to the Random Forest model using Topic Modeling. The code below will generate the input needed for modelplotr and plot results. The graph is very clear: the BERTje CLS model is better than the topic modeling model but has a lot to gain to throw the word embedding model from the stage.

The general CLS model used here simply uses the outcome of the 12th layer, flattened to a sentence representation embedding (using pooling) to perform the classification. But you might also argue that each attention head captures different characteristics. So instead of using the output from the final layer other choices can be made, like summing the last four hidden layers. We will not iterate upon these other options in this article, but do know that these options exist. A nice example can be found here.

Specify BERT model for finetuning

We move on in our endeavor by finetuning a BERT model for our classification task. We already prepared our training data in the previous steps so it’s time to configure our pretrained BERTje model for finetuning. Let us take a closer look at a simplification of how the BERT model is trained. This image was taken from the introduction blog on BERT by Chris McCormick. The BERT model has 12 Transformer layers. The vector representations of all tokens are encoded and decoded in each layer. The output token embeddings are summarized in the 12th layer. Another great graphical breakdown of BERT can be found here.

From the Hugging Face 🤗 library you can download various pre-trained model setups. You can download a general model and add additional layers for your downstream tasks, a masking model if you want to perform next word or sentence prediction. Since we have a specific, supervised task — using the full text as input (predict Michelin reviews using the review texts) — we need a classification model. Hugging Face provides a ‘ForSequenceClassification’ model setup ready for such a prediction task, all we need to do is to download the model and specify the training arguments. We follow this approach from the Hugging Face documentation and customize to R and our own data.

When you download a model from Hugging Face 🤗 you can always look at the configuration of the model when it was build and trained. The BERTje model is a classic BERT model with 12 hidden layers, 768 units per layer and 12 attention heads. The vocabulary size is 30.000 and it has an impressive number of parameters we can finetune:

BertConfig {
  "_name_or_path": "wietsedv/bert-base-dutch-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 3,
  "type_vocab_size": 2,
  "vocab_size": 30000
}

[1] 109082882

A few steps back we tokenized our texts. When using Hugging Face’s Transformer module to finetune the Tensorflow BERTje model, the input has to be a tuple containing a dictionary with the token ids, the token type ids, the attention mask as well as the labels. Below we prepare the Tensorflow input.

<TensorSliceDataset shapes: ({input_ids: (250,), token_type_ids: (250,), attention_mask: (250,)}, ()), types: ({input_ids: tf.int32, token_type_ids: tf.int32, attention_mask: tf.int32}, tf.int32)>

To finetune the BERTje model for our prediction task, we can create a trainer and train our BERTje model. Happily, pretraining is already done on an enormous corpus of Dutch text. The finetuning will alter parameters slightly to better match our task: distinguishing Michelin reviews from non-Michelin reviews. Since the model still has an impressive 109 million parameters to tweak, it still takes an hour or two on our 2-GPU cluster for just a single epoch.

Let’s have a look at how good this finetuned model is in predicting Michelin Reviews. Like we did in the previous blogs, we reviewed some metrics (AUC and confusion matrix related stats) and a few plots. First we need to get predictions for our test data, which is unseen during training and the same test data we used in our previous blogs. We use the trainer we just finetuned for our task to predict on the test data.

PredictionOutput(predictions=array([[ 2.4708478, -1.6745203],
       [ 3.7766562, -2.4462974],
       [ 4.209761 , -3.2831826],
       ...,
       [ 4.7139106, -3.418987 ],
       [ 3.093305 , -1.7391992],
       [ 3.80582  , -2.9154272]], dtype=float32), label_ids=array([0, 0, 0, ..., 0, 0, 0], dtype=int32), metrics={'eval_loss': 0.05102982077487679})

The predictions object is a list with the predicted logits per output class (in our case: Michelin review/non-Michelin review) per test review and the actual label per reviews (1=Michelin Review, 0=no-Michelin Review). The transformer module does not provide us with class probabilities but returns raw model outputs. For instance, the output for one review looks like this: [3.121,-1.102]. Therefore, we first apply a softmax transformation to each prediction to get class probabilities between 0% and 100% summing to 100% for each review, turning predictions into something like [0.976,0.024]).

Accuracy: 98% of Michelin/non-Michelin review predictions r correct
Precision: 78% predicted Michelin reviews are real Michelin reviews
Recall: 56% of all actual Michelin reviews are predicted as such
F1 score: 0.65  is the weighted average of Precision and Recall

Let’s put things in perspective and compare prediction results from the models we have up until now. The statistics from the confusion matrix are the best we’ve seen so far! Let’s also compare the models visually using modelplotr. The code below will generate the input needed for modelplotr, combine scores from previous models and plot results.

That is a quite impressive cumulative gains chart! The BERTje finetuned model outperforms our best word embedding model. After 5% of all cases the finetuned BERTje model retrieves nearly 81% of all reviews related to a Michelin star restaurant. Our best performing word embedding model detected 68% of those reviews at 5% of alle cases. Maybe we can do even better, let’s look at a competing transformer model to see if we further improve performance.

Use a competing transformer model: RobBERT

Let’s look at a competing transformer model build for the Dutch language. Scholars from the Paul G. Allen School of Computer Science & Engineering,University of Washington and Facebook AI found that the original BERT model was undertrained. To overcome this the researchers trained a new model called RoBERTa. This new model was trained much longer, used bigger data batches, has no next sentence prediction objective, is trained on longer sequences and has a dynamic masking pattern. The Dutch variant of RoBERTa is called RobBERT and was developed by the University of Leuven and Berlin. The model was trained on a Dutch corpus of 39GB with 6.6 billion words over 126 million lines of text. As a reference, the BERTje model was trained on only a 12GB corpus.

Below we download the model van the transformer package and set up the RobBERT tokenizer to get our text ready for input.

We use the same training arguments as for the BERTje model, fit the model, extract performance metrics and plot results to compare.

predicted
actual     0     1
     0 42772   215
     1   633   668Accuracy: 98% of Michelin/non-Michelin review predictions r correct
Precision: 76% predicted Michelin reviews are real Michelin reviews
Recall: 51% of all actual Michelin reviews are predicted as such
F1 score: 0.61  is the weighted average of Precision and Recall

Looking at the values in the confusion matrix results for the RobBERT model look very much like the values we saw estimating the finetuned BERTje model. In the graph below, using modelplotr we can even see that the BERTje model outperforms the RobBERT model by a few percent. At the fifth ntile the RobBERT model has 76% of Michelin star restaurant reviews detected against 81% for the BERTje model.

There’s not much difference in the predictions between BERTje and RobBERT in terms of the confusion matrix statistics and the modelplotr plots. Both finetuned models do an excellent job in predicting Michelin reviews.

Another option: multilingual distilBERT

We noted before that there are also some multilingual variants of BERT. These multilingual versions are not specifically trained on one language like BERTje and RobBERT are but are trained on a corpus of documents for 104 different languages. One of those is distilBERT, a distilled version of BERT that is smaller, faster, cheaper and lighter than BERT without losing significant performance. We don’t show the code here since this is very much the same as for BERTje and RobBERT but do show the performance below. We used the transformer model distilbert-base-multilingual-cased.

You can see it in the graph and the results below speak for themselves, the Dutch BERTje model is the overall best performing NLP model for our downstream task: extract knowledge from restaurant review texts to predict which of those reviews are written for Michelin restaurants and which are reviews for non-Michelin restaurants.

Wrapping it up

In this article we used transformer models to predict which restaurant is more likely to receive a next Michelin star. We showed that, when finetuned, these transformer models do an excellent job to retrieve important information from the review texts. There’s not much difference between a number of variants but all of them outperform our predictions using custom topic models and word embeddings. Getting these models up and running does come at a price. Using a Transformer model, as we did in this article, required quite some (expensive) GPU time, whereas fitting the word embedding models in Keras was perfectly doable on a midsized laptop. So think twice before you start: how accurate do my predictions need to be? And is the trade-off between hardware investment and a better accuracy positive in a production environment?

Another note is that you might question how well the original BERT model training tasks are suited for the classification task we are performing. From research it is known that a model trained for a specific task performs better than a general model with a large amount of generic data. One might also question whether the data we have gathered (restaurant review) is generic enough ;) How many times is discussing the menu, the taste of a dish or the servant present in Wikipedia text? An option would be not just to finetune an existing transformer model but to build your own transformer model using BERT architecture, the same way we did for our word embeddings. That would require to have a substantive amount of reviews available for training.

Here we’ve shown you how to use transformer models for text classification using a bit of Python (importing models from Hugging Face) and building a model in R using Keras. Hopefully you are as enthusiastic as we are about NLP, and we hope that these blogs help you to extract valuable information from text you the task at hand.

This article is of part of our NLP with R series. An overview of all articles within the series can be found here.

Previous in this series: Using Word Embedding Models for Prediction Purposes

Do you want to do this yourself? Please feel free to download the Databricks Notebook or the R-script from out gitlab page.