How to Reduce Your Transformers Model Size by Removing Unwanted Tokens and Word Embeddings

4 min readMay 14, 2022

An example of removing unwanted tokens and corresponding word embeddings from pretrained transformers models to reduce model size.

Model Size Before and After Modification

Why Would Anyone Want to Do This (my case)

Multilingual NLP models are usually larger than their monolingual versions due to bigger word embeddings, resulting in a slower loading time. However, sometimes you don’t need every language in a multilingual model. Also, if your use case has some model size restrictions, you may want to remove unnecessary languages to reduce model size.

In my case, I wanted to fine-tune a LayoutXLM Model (a multilingual version of LayoutLmv2) and deploy it to AWS Lambda. My data for inference will only be in 4 languages (English, Chinese, Japanese, and Korean) while the pretrained model was trained in 53 languages. The original model weight takes up to 1.38Gb, resulting in very slow model loading in AWS Lambda environment. By removing unwanted languages from the model and tokenizer, I reduced the weight size and loading time by 29% (1.38Gb →995Mb).

I will demonstrate how to remove unwanted languages from a pretrained layoutXLM model in this blog. If you want to run the code I shared, you will need to install transformers, pytorch, torchvision, and detectron2 as dependencies.

Steps you need to take

If I only want to keep tokens in English, Chinese, Japanese, and Korean, I need to take the steps below.

Identify tokens that are NOT in the languages of interest
Delete the identified tokens from the pre-trained tokenizer
Delete the corresponding word embeddings from the pre-trained model

Identify Unwanted Tokens

We can identify tokens in unwanted languages from their Unicode value.

We can then write a function that tells us if a specific token should be removed or not. This function returns true if we want to keep the token and returns false if we want the token to be removed.

Remove Tokens from Tokenizer

Before removing tokens, let’s see how the tokenizer of LayoutXLM look. We can inspect the tokenizer configuration files directly by saving them first.

Next, we can iterate through the vocabs and identify if a specific token is unwanted or not. You can see that only 43% of the tokens are in the languages of our interest.

Now we have all our unwanted tokens in a list. We can then pop them out of the tokenizer vocabs. After modifying the vocabs, we want to save the modified tokenizer.json and load the tokenizer with the new vocabs.

Now all files needed for loading the modified tokenizer are all in the modofied_tokenizer folder.

The length of the tokenizer has become 109003. If you inspect the vocabularies in tokenizer.json, you can confirm that all tokens in unwanted languages are removed.

Remove Corresponding Word Embeddings from Model

After removing the tokens from the tokenizer, we need to adjust the model embeddings (this is how we can reduce the model size). The embeddings are in the same order as the tokenizer vocabs, so we can use our wanted_index_list to select only the embeddings we want to preserve.

First, let’s check the original model size and its embedding shape. The original model size is 1476489767 bytes, which is 1.38Gb.

Since we know the indices of the rows we want, we can use torch.index_select to select them. Then we can replace the model embeddings with the selected embeddings. Beware that the type of selected_embedding will be torch.Tensor, we need to cast it to torch.nn.parameter before assigning it to be the new embeddings.

We can confirm that the model word embeddings size has become [109003, 768]. Also, we can see that we have reduced the model size to 1043340839 bytes (a 29% decrease).

Result

We removed 57% of the tokens and their corresponding word embeddings in the pretrained LyaoutXLM model. By doing this, we successfully reduced the model size by 29%. Because the embeddings we removed will not be used during inference anyways, the model performance will not be affected.

Conclusion

Pretrained multilingual transformers models are sometimes trained in more languages than we need. We can optimize the model size and loading time by removing unnecessary languages. There are a lot of resources about how to add tokens to transformers models, and HuggingFace provides easy-to-use methods to do that. However, there are almost no posts about how to remove tokens and word embeddings, so I wanted to share my experience.

Of course, you can choose to pretrain a model and tokenizer with only the language you are interested in. However, I didn’t have the data and time to do that, so I chose to modify a pretrained model.

Reference

How to remove tokens from a pretrained roberta tokenizer: https://github.com/huggingface/transformers/issues/15032

HF LayouXLM docs: https://huggingface.co/docs/transformers/model_doc/layoutxlm

LayoutXLM Paper:
https://arxiv.org/abs/1912.13318