Language-Agnostic Text Classification With LaBSE

Building a language-agnostic classifier using sentence embeddings

Published in

The Startup

5 min readOct 4, 2020

The folks at Google published a multilingual BERT embedding model, called LaBSE. It produces language-agnostic sentence embeddings for more than 100 languages in a single model. The model is trained to generate similar embeddings for bilingual sentence pairs that are translations of each other.

I wanted to build a language-agnostic text classifier, which can be used to detect texts about soccer in any language. The plan was to train it with English texts and then classify texts in other languages. Let’s see how to do it.

Training data

My training data consisted of ~550k short texts and half of them were related to soccer. Here are some examples:

We played well at times this time and made it very difficult for them and it’s not easy to do that because they have the fans and they are a very good football side.”. A 26-year-old Liverpool supporter was released from hospital in Naples on Wednesday after being attacked by rival fans before the game.
The conjunction of the passing of McNeill and Chalmers has also been attended by an eerie sense of other-dimensional intervention. On Saturday, after a poignant celebration of McNeill’s career, the game against Kilmarnock was settled by a goal scored after 67 minutes of play by a centre-half wearing McNeill’s No 5.
A fee of €180 million (166.4m) has been fixed for the 18-year-old striker who has set his heart on joining PSG , the club he supported as he grew up in Paris. Mbappe is refusing to sign a new deal with Monaco who had hoped to keep him for one more year.

Loading LaBSE

LaBSE is available from Tensorflow Hub which makes the integration with Keras easy. You can use a Tensorflow Hub layer in your model and just point it to the pre-trained Saved Model (https://tfhub.dev/google/LaBSE/1).

The inputs are the same as in standard BERT: the tokenized text, the input mask specifying how the text was padded and the segment IDs which are always zeros.

Rest of the model

As I trained my model only with a relatively small amount of English data, I didn’t want to touch the learned weights from LaBSE. You can do this by setting the trainable flag to false for the LaBSE layer, which freezes the weights during training. I just added a couple of fully connected layers, which use the raw embeddings as input, and fine-tuned these layers for this specific task.

The final model is seen in the table below. The model looks simple because the KerasLayer hides the complexity of LaBSE.

There are almost half a billion parameters, thanks to LaBSE, but only 0.1% will be adjusted in training. The input dimension is 256 which is the max length of the input text. The shorter texts in the data will be padded to meet this length.

Data preprocessing

You first need to build a BERT tokenizer for converting the input text into tokens the model understands. The create_bert_input function converts the text into token IDs using the tokenizer and then will either truncate or pad it depending on its length.

You can find the following code from Tensorflow Hub: https://tfhub.dev/google/LaBSE/1

I have the labeled data in a CSV with two columns, “target” and “sentence”. The former is simply 1 if the sentence is about soccer and zero if it’s not. And using the create_bert_input function defined earlier, the sentences are converted into three numpy arrays compatible with the model.

I didn’t experiment with different loss functions or optimizers but just (blindly) selected Adam and binary cross-entropy. To limit the costs of training and to save time, EarlyStopping callback was specified as well to stop training when the validation accuracy plateaus.

I trained the model on Google Cloud’s AI Platform which offers hosted Jupyter notebooks with GPU backends. Training for one epoch took roughly an hour with one Nvidia V100 and only after two epochs, validation accuracy plateaued at 98,6%. I had the patience set to one in the EarlyStopping callback so it trained for 4 epochs before stopping.

Train on 496965 samples, validate on 55219 samplesEpoch 1/20
- loss: 0.0615 - accuracy: 0.9810 - auc: 0.9941 
- val_loss: 0.0564 - val_accuracy: 0.9826 - val_auc: 0.9953Epoch 2/20
- loss: 0.0488 - accuracy: 0.9857 - auc: 0.9956
- val_loss: 0.0480 - val_accuracy: 0.9863 - val_auc: 0.9951Epoch 3/20
- loss: 0.0438 - accuracy: 0.9873 - auc: 0.9963
- val_loss: 0.0458 - val_accuracy: 0.9863 - val_auc: 0.9960Epoch 4/20
- loss: 0.0401 - accuracy: 0.9883 - auc: 0.9968
- val_loss: 0.0470 - val_accuracy: 0.9869 - val_auc: 0.9951

Red: Training loss, Blue: Validation loss

Above graph shows the losses. Red line is the training loss and blue line is the validation loss. The accuracy is surprisingly high so I’m pretty sure it has already overfitted. Or then my training data is easy, containing only clear examples of both classes (soccer and non-soccer) and no hard ones. E.g. no texts about American football which can be easily confused with soccer, a.k.a. European football.

Evaluate with other languages

The original goal was to build a language-agnostic classifier so I evaluated the model with Finnish and German texts as well. I had 20–30k positive and negative texts for soccer in both languages and the model was able to classify them correctly with >80% accuracy.

Finnish (41022 examples):
- loss: 0.8141 - accuracy: 0.8224 - auc: 0.9332German (66754 examples):
- loss: 0.5621 - accuracy: 0.8688 - auc: 0.9415

Measuring the precision with random texts

I also classified a set of random Finnish texts with the model to better understand precision. Out of ~2000 texts, it classified 14 texts as soccer and 11 of those were correct according to my manual review. Which is not too bad. Well, this was a small sample but still.

Here are some of the texts (in Finnish) classified as soccer. E.g. one Formula 1 text was incorrectly classified as soccer for some reason.

0.981406 Viistosti kohti maalia […]. Lue lisää.
0.9838743 Ohjeistus oli selvä, että meidän pitää pysyä täällä, joten olin hieman yllättynyt, että jotkut matkustivat koteihinsa, ranskalainen kommentoi . Ferrarille varoitus oli toinen, sillä Sebastian Vettel oli nähty viime viikolla keskustelemassa Red Buliin Christian Hornerin ja Helmut Markon kanssa ilman maskia .
0.8813341 ManU-kamppailu mukaan luettuna sen taivalta tämän kauden liigassa on jäljellä vain 13 ottelua. ManUlle ottelu merkitsee lähinnä kunniasta pelaamista ja Mestarien liigaan oikeuttavien sijojen jahtaamista.
0.9545789 Keown (kuvassa) korvaa kapteenina selkävaivaisen Tony Adamsin, jonka vamma tuli maanantain aamuharjoituksissa. Aiemmin olivat loukkaantumisen takia jo sivuun jääneet puolustaja Graeme Le Saux, varakapteeni David Beckham ja Steven Gerrad.
0.6568378 Saa nähdä värjääkö hän koko päänsä kultaiseksi, jos Ranska vie mestaruuden.

The last one is quite a nice example because it’s quite hard even for a human to know the context. For non-Finnish speakers, the text states “Let’s see if he dyes his hair to gold if France becomes a champion.”. This text is from an article about Paul Pogba’s haircut during the 2014 FIFA World Cup so the model got it right. But most likely, there is a strong positive correlation between France and soccer in the training data that the model has learned so it will probably misclassify other texts about France.