An article for NLP and language lovers
A few months ago, Google AI open-sourced BERT, a large pre-trained language model that can be easily fine-tuned to solve common NLP tasks like classification or named entity recognition. Though checking out the initial English model was interesting, I got excited when the multilingual model came out — covering 104 different languages including German, Russian, Arabic and Japanese. Two questions came to mind: Does this solve all my German NLP problems now? And how exactly can a single model cope with so many different languages?
The input text gets split into meaningful word pieces before it is fed to the BERT model:
He's your dentist? --> He ' s your den ##tist ?The special characters ## mark the continuation of a word. For the English examples I had seen so far, the splits always looked perfect. To accomplish the same for 104 languages with different alphabets sounded crazy to me. Driven by curiosity, I installed Hugging Face’s PyTorch implementation of BERT to try it out.
To be honest, this was not what I expected. Neither -o nor -ätzen are common German suffixes, the words seemed split in somewhat arbitrary ways to me. Checking the tokenizer code revealed why:
- Basic Tokenization:
The sentence gets split by spaces (and punctuation. And Chinese characters are treated as single words).
- Word Piece Tokenization:
Each word is split into word pieces that are part of BERT’s vocabulary. If the word as a whole is not in the vocabulary, the tokenizer searches for the longest prefix that is in the vocabulary. The same procedure is done for the remaining end of the word.
So apparently, the German word Hallo is not part of the vocabulary, while the prefix Hall is — eventhough it is not a German word, but (most likely) an English one! There is no language detection, the word piece tokenizer can happen to mix up languages.
The basic tokenization is straightforward for those languages that separate words with spaces — it worked perfectly in my German example above. But what about languages without spaces, like Japanese?
After releasing the multilingual model, the basic tokenizer was changed to not only look for spaces and punctuation, but additionally, to separate Chinese characters. Japanese uses Chinese characters like, 日本語, but also two syllable alphabets (hiragana like, はをいません, katakana like, スペース) — sequences of syllable characters simply don’t get split at all here. The basic tokenizer cannot appropriately split words if the language doesn’t use spaces.
Trying out the code helped me to be aware of the pitfalls of BERT’s tokenization. Yet one piece was still missing: How was the vocabulary collected? I was surprised to find out that Hallo is not in the vocabulary, while names like Hannah or Piotr and a total of 2321 numbers made it in. Fun fact: the English-only model has a vocabulary size of 28,996 tokens, while the multilingual model has “only” 119,547 tokens for all languages together!
The answer can be found in this paper from 2006:
Given a training corpus and a number of desired tokens D, the optimization problem is to select D wordpieces such that the resulting corpus is minimal in the number of wordpieces when segmented according to the chosen wordpiece model.
Aha! The selection of vocabulary is data-driven, which leads to the question of what data the multilingual BERT model was trained on. According to the official repository, the entire Wikipedia dump for the supported languages was used. The different languages are of course not equally represented in Wikipedia, which is why high-resource languages like English were under-sampled and low-resource languages were over-sampled. Nevertheless, languages with large Wikipedias dominate the training data and are, therefore, more likely to make it into the vocabulary.
The English Wikipedia is actually more than twice as large as the German one. Does that mean there are too few German words in the vocabulary to solve my NLP tasks? Time for an experiment!
This code checks for a given input text, how many of the words used are in the BERT vocabulary, and how many are not. Here at omni:us, we care about analyzing insurance documents — so I decided to use an English and a German text from BERT’s training set (the Wikipedia articles about Liability insurance / Haftpflichtversicherung), as well as an actual German insurance document as input.
Liability insurance article (English):
- 5668 in vocab (e.g. insurance, risk, imposed)
- 788 not in vocab (e.g. liability, complaint, insufficient)
Haftpflichtversicherung article (German):
- 1236 in vocab (e.g. Kosten, Bedingungen, verpflichtet)
- 579 not in vocab (e.g. Kriterien, Kfz, vertraglichen)
Insurance document (German):
- 2941 in vocab (e.g. Beiträge, Beginn, jeweils)
- 1151 not in vocab (e.g. Versicherung, Gesamtbetrag, beinhalten)
After diving in to the topic a little bit, here is my conclusion: The multilingual word piece tokenization didn’t convince me at first, as I could quickly come up with examples that didn’t work. Then I came to the thought that even an inaccurately tokenized text might be an easier input for the model than pure characters, or a lot of out-of-vocabulary words. The authors of the word piece idea confirmed my thoughts:
This method provides a good balance between the flexibility of “character”-delimited models and the efficiency of “word”-delimited models.
Well, of course, expecting a pre-trained model to know very specific vocabulary for any domain and any language is unrealistic. The amazing part about multilingual BERT is: We get a model for free that already understands the usage of common words like if, except, any, not, which is very difficult to achieve on small application datasets. Whether BERT can also master difficult domain language after fine-tuning remains to be seen. I think it’s worth to give it a try!