Automatic Model Loading using Hugging Face

From a Large Selection of BERT-based Models

Debanga Raj Neog, Ph.D
Depurr
4 min readMay 18, 2020

--

This article shows how we can use Hugging Face’s Auto commands to reduce the hustle of specifying model details as we experiment with different BERT-based models for Natural Language Processing Tasks. From my own experiences, I found it really time-consuming to write custom tokenizers and model loading scripts for various reasons:

First, across NLP models there are variations in their tokenizers as they are often trained on different datasets, e.g., BERT and RoBERTa models have entirely different structures of tokenizers.

Second, even for a single type of model, there are variations based on cased or uncased texts, or how deep the models are, e.g., BERT model can have a cased and an uncased version, just to name a few.

Third, most of the NLP training is essentially transfer learning, so we have relied heavily on pretrained weights. Looking for the correct pretrained model of a particular version (e.g., cased) of a particular type of model (e.g., RoBERTa) is tedious.

Photo by Romain Vignes on Unsplash

So, for our rescue Hugging Face has provided a very convenient Auto tool and I will demystify it here.

Therefore, with the Auto Tool

1. We don’t need to hand-code text sequences to satisfy the need of tokenizers of different BERT models.

2. NLP models can be changed just by changing a global model_name variable and Hugging Face API will take care of the rest.

That’s simply amazing!

Please note that Hugging Face Auto modules are actively under development if something new is added to their library that I missed please mention it in the comment, and I will update the article.

How to Code in Python using Auto

Import Packages

Get Data

Global Variables

Available Models

Disclaimer: Most popular models are compatible, but not all. Just change the model_name to check if it is.

Define Tokenizer

AutoTokenizer

  1. To find how to estimate “offset” in auto settings, I had to dig a lot. It was recently added as a part of PreTrainedTokenizerFirst. Please look at this thread for more details.
  2. Offset estimation is not implemented in albert models. Therefore, you have to set return_offsets_mapping=False for albert models
  3. RoBERTa does not have a token_type_ids. Remember from their paper that RoBERTa dropped second sentence prediction from the architecture, so just set all token ids to 1!!

Sanity Check Tokenizer

Define Model

AutoModel

Some excellent examples are available here: https://huggingface.co/transformers/usage.html

Now you can do your own transfer learning!

Please leave your comment if you have any questions.

--

--

Debanga Raj Neog, Ph.D
Depurr

Entrepreneur. Computer Vision Expert. Machine Learning Nerd. Buy me a Chimay Blue.