NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece

Pierre Guillou
13 min readApr 5, 2021

--

NLP & Domain-Specific | How to add a specialized vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece
NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece (image credit)

In some cases, it may be crucial to enrich the vocabulary of an already trained natural language model with that from a specialized domain (medicine, law, etc.) in order to perform new tasks (classification, NER, summary, translation, etc.). While the Hugging Face library allows you to easily add new tokens to the vocabulary of an existing tokenizer like BERT WordPiece, those tokens must be whole words, not subwords. This article explains why and how to obtain these new tokens from a specialized corpus.

The notebook of this tutorial is available online (github | colab) and the main code is published in the annex of this post.

Context

Some of my current research in NLP (Natural Language Processing), both in the AI Lab (UnB, Brazil) as Associate Researcher and in my work in the private sector as AI consultant, concerns the adaptation of already trained natural language models to specialized domains (medicine, law, etc.) in order to perform new tasks (classification, NER, summary, translation, etc.).

Domain-Specific vocabulary

If the most used NLP methods consist either in fine-tuning on a new corpus a natural language model already trained without modifying its tokenizer vocabulary but only its embeddings (BioBERT), or in training from scratch on a new corpus a natural language model (SciBERT), these methods are not necessarily the most efficient (high number of subword tokens per tokenized text vs high training costs in terms of training data size and computation time).

An intermediate method consists in fine-tuning on a new corpus an already trained natural language model by adding to its existing tokenizer vocabulary that specific to the domain of this new corpus (exBERT).

In order to be able to use this last method, I was therefore interested in how to find and add this domain-specific vocabulary to and existing one in the case of a BERT model and its WordPiece tokenizer, which are frequently used in NLP.

Hmmm… let’s take a “human” example

[ 1. natural language model ] Take for example the case of a Frenchman (let’s call him… Pierre ;-) who arrives in a new country (let’s say… United States) and decides to learn English. After a few (months/years) of effort, he will have learned the everyday English language.

[ 2. fine-tuning of the model on a specialized domain ] Now that Pierre knows everyday English, he decides to work in the medical domain. His job will consist of qualifying the content of medical papers written in English (dates, people, places, drugs, medical acts, etc.). To succeed in this new job, he will first have to deepen his knowledge of English by studying the medical domain. Thus, after a few (days/weeks/months) of study, Pierre will have enriched his English vocabulary with words specific to medicine and his understanding of medical papers.

[ 3. fine-tuning of the specialized model to a new NLP task ] Then, with new efforts and time, this new knowledge will allow him to be able to carry out his new task :-)

Voilà. My research therefore consists in condensing into a few days (or even hours when possible) what can take years for a human. Let’s take a look at how.

Observação: esse tutorial e o notebook associado foram escritos em inglês e todos os exemplos, modelo e tokenizer também são de lingua inglesa. Porém, todo o método exposto pode ser aplicado a qualquer idioma, como o português por exemplo :-) Veja o código para usar esse método em português no final do post.

Tutorial preview

What this tutorial contains

This tutorial covers the following steps which correspond to points 1 and (partially) 2 presented above:

  • downloading an already trained natural language model,
  • increasing its vocabulary by adding words from the domain of specialization (for example, specific and recurring terms in medical papers).

What this tutorial does not contain

The fine-tuning to the specialized corpus (for example, medical papers) of the natural language model with an enriched vocabulary (second part of point 2 presented above) and the fine-tuning of this new model to a new task like Named Entity Recognition (NER) in documents of this specialized domain (point 3) are not part of this tutorial.

This was not necessary because Hugging Face has published online the following scripts and notebooks which allow these 2 steps to be carried out with few modifications to its code:

Be careful to the following point

We must draw your attention to the following point: if the number of specialized words added to the vocabulary of the downloaded natural language model is small as a percentage of the existing ones (1% for example), use the fine-tuning method proposed by Hugging Face in his notebook (ie, How to fine-tune a model on language modeling) certainly remains relevant.

On the other hand, if the increase in the specialized vocabulary is important, this technique of fine-tuning presents the danger of a Catastrophic Forgetting, in addition to requiring a corpus of specialization of several GB. It will then be necessary to opt for a more elaborated technique of fine-tuning of a natural language model towards a specialized domain such as for example exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources (nov. 2020).

1. Download a natural language model

This is the easiest part. Thanks to Hugging Face, there is online a models hub of natural language models in a large number of languages.

For the record, a natural language model is an NLP model which has learned a language through training: using a corpus containing GB of texts (Wikipedia, Web pages, etc.), the model has been trained to guess (via a probability) the word in a sentence knowing the words that precede it.

Thus, with a few lines of code, it is very easy to download a model that has learned a language such as English for example. Thus, let’s download a BERT base cased model in English.

# Install last Hugging Face libraries (datasets & transformers)
!pip install datasets git+https://github.com/huggingface/transformers/
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model)

Easy, no?

2. Increasing the model vocabulary by adding words from the domain of specialization into the tokenizer vocabulary

Things are getting trickier here.

How to add new tokens to an existing vocabulary

First of all, let’s start at the end: if we have a list of words (for example, specific words from a specialized domain such as medical), it is (very) easy thanks to the Transformers library from Hugging Face to add it to the vocabulary of the downloaded natural language model.

In order to do that, just use the tokenizer.add_tokens() function to add this list of words to the tokenizer’s existing vocabulary. And if a word from this list already belongs to the existing vocabulary, it will not be added, thus guaranteeing the addition of words not already present. Here is the corresponding code:

# Let's increase the vocabulary of Bert model and tokenizer
new_tokens = [token1, token2, token3, ..., token n]
num_added_toks = tokenizer.add_tokens(new_tokens)
print('We have added', num_added_toks, 'tokens')# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))

Observation from Hugging Face: when adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer. In order to do that, please use the resize_token_embeddings() method.

That’s it? Nothing more to do than fine-tune our model with its increased vocabulary? Hummm… let’s check out the nature of the new tokens and the method used to find them.

Vocabulary of a BERT tokenizer: subwords and words

Each tokenizer is different both in its method of obtaining a list of tokens (its vocabulary) and in the nature of these. BERT for example uses a WordPiece tokenizer, ie a subwords and words tokenizer (paper: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (Google, sept. 2016)).

A subword is a set of characters which associated with one or several others will form a word. If it does not match the start of a word, it begins with ##.
For example, the downloaded BERT model vocabulary does not contain the words COVID and hospitalization. It then tokenizes them with subwords as follows: 3 tokens for the word COVID and 2 for hospitalization.

print(tokenizer.tokenize('COVID'))
print(tokenizer.tokenize('hospitalization'))
['CO', '##VI', '##D']
['hospital', '##ization']

However, it is expected that a natural language model specialized in the medical field has these 2 words in its vocabulary, and therefore without having to use subwords to tokenize them. Thanks to the tokenizer.add_tokens() method presented above, it is then easy to insert these 2 words into the existing vocabulary and check that it works:

# Let's increase the vocabulary of Bert model and tokenizer
new_tokens = ['COVID', 'hospitalization]
num_added_toks = tokenizer.add_tokens(new_tokens)
# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))
print(tokenizer.tokenize('COVID'))
print(tokenizer.tokenize('hospitalization'))
['COVID']
['hospitalization']

Great! But if we want to specialize a natural language model to a specific domain like that of the medical world, it will certainly have much more than 2 words to add to the vocabulary of the downloaded model. How to get them?

Obtain the list of tokens of the specialized domain

A technique commonly used is then to train a tokenizer of the same nature (ie, a BERT WordPiece tokenizer here) on the specialized corpus, which will make it possible to obtain a vocabulary specific to this corpus, vocabulary which can then be added to the existing one (at least, the tokens not already existing in the initial vocabulary) using the tokenizer.add_tokens() method presented above.

Observation: moreover, Hugging Face gives us the script to train a BERT tokenizer on a corpus: All together: a BERT tokenizer from scratch

As easy as that? No. There is a problem due to the nature of the tokens of a subword tokenizer.

The need to add only words, not subwords

If we apply the previous method, the list of new tokens (tokens from the corpus of the specialized domain and not already present in the existing vocabulary) also contains subwords in addition to words. Adding this list to the vocabulary of an already trained WordPiece type tokenizer will then cause errors. More precisely, the new subwords will cause these errors because, not being from the training of the initial tokenizer, they do not correspond to the logic of its method (tokens ordered by frequency in the tokenizer vocabulary).

To demonstrate this, let’s continue with our example of the words COVID and hospitalization. Instead of adding only these 2 words as done above, let’s train a new BERT WordPiece type tokenizer on 2 Wikipedia pages dedicated to COVID (COVID-19 and COVID-19 pandemic) by using the Hugging Face script (All together: a BERT tokenizer from scratch) and add the new tokens (subwords and words) to the vocabulary of the initial tokenizer as explained (only tokens different from those of the initial vocabulary will be added by the tokenizer.add_tokens() method): it’s a disaster!

# Let's increase the vocabulary of Bert model and tokenizer
# new_tokens = (see our notebook) list of new tokens found by a new BERT WordPiece type tokenizer on 2 Wikipedia pages dedicated to COVID (COVID-19 and COVID-19 pandemic)
num_added_toks = tokenizer.add_tokens(new_tokens)
# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))
print(tokenizer.tokenize('COVID'))
print(tokenizer.tokenize('hospitalization'))
['CO', '##VI', '##D']]
['ho', 'sp', 'it', '##a', 'li', 'z', '##ation']

The word COVID is tokenized in 3 tokens and, exept the first one, all start with ##: that’s correct. However, the tokenization of the word hospitalization did not work: it is badly tokenized by a few subwords (after the first) with… and without ##!

Sadly, our enriched tokenizer is not working well :-( The problem comes from the new tokens as subwords.

The solution: use a word tokenizer like spaCY to find new tokens, not a subword tokenizer

spaCY is a words tokenizer well known. Let’s use it to find the most frequent words of our corpus instead of a WordPiece tokenizer which generates subwords as well.

Observation: here, the expression “most frequent words” means “the tokens present in most of the documents”.

You will find all the code in our notebook. In summary, here are the main steps in this process:

  • Initialize the spaCY tokenizer with the general vocabulary of the language model (which is the same as that of the corpus: here, English).
  • Get the list of tokens for your documents using the spaCY tokenizer (we don’t keep stop words, punctuation, etc.): these tokens are just words!
  • Using the scikit-learn library, get the IDFs (Inversed Document Frequency) of these tokens.
  • Thanks to IDFs, organize tokens in a list ranging from the most frequent token in documents to the least frequent one.
  • Decide which proportion of these new tokens will represent your specialized vocabulary and add them to the existing tokenizer vocabulary.
  • Resize the model embeddings matrix so that it matches the tokenizer (new) size (to the token embedding vectors of the existing vocabulary will be added as many new embedding vectors as there are new tokens added).

Voilà. :-)

Let’s check the impact of our enriched tokenizer

Let’s use a text about COVID taken from a newspaper site (not from Wikipedia).

# source: https://edition.cnn.com/2021/04/05/health/us-coronavirus-monday/index.htmltext = 'Experts say Covid-19 vaccinations in the US are going extremely well -- but not enough people are protected yet and the country may be at the start of another surge. The US reported a record over the weekend with more than 4 million Covid-19 vaccine doses administered in 24 hours, according to the Centers for Disease Control and Prevention. And the country now averages more than 3 million doses daily, according to CDC data. But only about 18.5% of Americans are fully vaccinated, CDC data shows, and Covid-19 cases in the country have recently seen concerning increases. \"I do think we still have a few more rough weeks ahead," Dr. Celine Gounder, an infectious diseases specialist and epidemiologist, told CNN on Sunday. "What we know from the past year of the pandemic is that we tend to trend about three to four weeks behind Europe in terms of our pandemic patterns."'

Now, let’s tokenize this text both with the original BERT tokenizer and its enriched version.

tokens = tokenizer.tokenize(text)
print('number of tokens by the original BERT tokenizer:', len(tokens))
tokens = tokenizer_exBERT.tokenize(text)
print('number of tokens by the enriched tokenizer:', len(tokens))
# number of tokens by the original BERT tokenizer: 203
# number of tokens by the enriched tokenizer: 193

As expected, we find that the enriched tokenizer needs less tokens (here, 5%) to tokenize the text on COVID than the original BERT tokenizer.

To be continued

Now that we have enriched our tokenizer vocabulary with words specific to our corpus, we need to fine-tune the natural language model it is associated with (here, the bert-base-cased model). Indeed, the addition of new words led to the increase of the matrix of embeddings of the model by the same number: with each new word added, a new vector of embeddings with random values was added as well thanks to the model.resize_token_embeddings(len(tokenizer)) method.

So we need to train (or fine-tune) our model on our corpus so that the model can learn the embeddings of these new words.

Hugging Face provided a script and a notebook to fine tune a natural language model on a new corpus (How to fine-tune a model on language modeling: script | github | colab). We therefore have a ready-to-use code. However, it is possible that this code is not adapted to your situation because if the number of new words (and therefore of new embeddings vectors) is high, it is possible that the training by this code leads to a Catastrophic Forgetting by modifying in a sensitive way the vectors of embeddings of the tokens of the initial vocabulary.

My advice: do a Google search with this type of “fine-tune a pre-trained model for a specific domain” query. You will get all the interesting articles and documents on this topic. Have fun!

Annex | Main code of this tutorial

This tutorial and the associated notebook were written in English and all examples, model and tokenizer are also in English. However, the entire method exposed can be applied to any language. To adapt the following code to any language, choose the spaCY tokenizer and the BERT model for that language.

  • For example for the English language: “en_core_web_sm” for the spaCY tokenizer and “bert-base-cased” for the natural language model.
  • Se deseja aplicar esse tutorial ao português, escolha “pt_core_news_sm” para o tokenizer spaCY e “neuralmind / bert-base-portuguese-cased” (por exemplo) para o modelo de linguagem natural.
# import libraries
import transformers
import numpy as np
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
# initialize our tokenizer with the Portuguese spaCY one
nlp = spacy.load("pt_core_news_sm",
exclude=['morphologizer', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
# define spaCY tokenizer
def spacy_tokenizer(document, nlp=nlp):
# tokenize the document with spaCY
doc = nlp(document)
# Remove stop words and punctuation symbols
tokens = [
token.text for token in doc if (
token.is_stop == False and \
token.is_punct == False and \
token.text.strip() != '' and \
token.text.find("\n") == -1)]
return tokens
# apply spaCY tokenizer through scikit-learn
# https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
tfidf_vectorizer = TfidfVectorizer(lowercase=False, tokenizer=spacy_tokenizer,
norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
# parse matrix of tfidf
length = len(documents)
result = tfidf_vectorizer.fit_transform(documents)
# get idf of tokens
idf = tfidf_vectorizer.idf_
# get tokens from most frequent in documents to least frequent
idf_sorted_indexes = sorted(range(len(idf)), key=lambda k: idf[k])
idf_sorted = idf[idf_sorted_indexes]
tokens_by_df = np.array(tfidf_vectorizer.get_feature_names())[idf_sorted_indexes]
# choose the proportion of new tokens to add in vocabulary
pct = 1 # all tokens present in at least 1%
index_max = len(np.array(tokens_pct_list)[np.array(tokens_pct_list)>=pct])
new_tokens = tokens_by_df[:index_max]
# import a Portuguese model and tokenizer
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "neuralmind/bert-base-portuguese-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)
## add new tokens to the existing vocabulary (only those not already presents)
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer))
added_tokens = tokenizer.add_tokens(new_tokens)
print("[ AFTER ] tokenizer vocab size:", len(tokenizer))
print()
print('added_tokens:',added_tokens)
print()
# resize the embeddings matrix of the model
model.resize_token_embeddings(len(tokenizer))

__________

About the author: Pierre Guillou is an AI consultant in Brazil and France, a Associate Researcher in Deep Learning and NLP at AI Lab (Unb) and professor of Artificial Intelligence (UnB). Get in touch with him through his LinkedIn profile.

--

--