The Super-Impatient’s Guide to Getting Started with Bort

Cees Roele
Riga Data Science Club
4 min readFeb 4, 2021
Photo by Georg Arthur Pflueger on Unsplash

“We tested Bort on 23 NLU tasks, and on 20 of them, it improved on BERT’s performance — by 31% in one case — even though it’s 16% the size and about 20 times as fast.” — Adrian de Wynter

The objective of the present article is to show how to overcome the few issues you’ll encounter when you try to get Bort running in the popular and convenient environment of Simple Transformers. This environment is built on top of Hugging Face Transformers.

Bort

First you create something that works well, then you optimise it.

Bort optimises BERT. Note that I use the capitalisation used in the original articles on Bort, rather than “BORT”. For reading more on Bort, I suggest starting with “A version of the BERT language model that’s 20 times as fast” by Adrian de Wynter.

My personal interest is this: I want the best language model that can run on my modest 8GB GPU. BERT-large and RoBERTa-large need more memory so they don’t work. Bort is way smaller while promising a similar performance so I was keen on giving it a try.

The original open source code for Bort is available on github since October 2020. Instructions on getting started are clear, but, as the author admits: “Bort requires a rather unusual environment to run.”

Bort in the Simple Transformers environment

To use Bort as a model with my existing code I wanted to make it work within Thilina Rajapakse’s Simple Transformers environment. The hard work to make this possible has already been done by Stefan Schweter. His pull request for Bort was committed to the master branch of Hugging Face Transformers on January 27th. It should be part of Transformers as of version 4.3.0 and available as model type “bert” and model name “amazon/bort”.

At the time of writing that version of Transformers has not yet been released, so let’s install it from the master branch.

… and install Simple Transformers:

Easy. But there is one catch. Let’s consider classification. Normally, we would initialise a ClassificationModel in Simple Transformers using Bort like:

That gets us the model, but it doesn’t work due to not invoking the right tokenizer.

Combining model type and tokenizer

The Hugging Face implementation defines Bort as just another instance of BERT and suggests to use the existing RoBERTa tokenizer. This is elegant because no new object types have been created where they are not functionally different from existing ones.

However, this brings us into slight trouble with Simple Transformers. There, in the different classes defining tasks, like ClassificationModel, the configuration, model class, and tokenizer are defined together and identified through the model type. As the model type for Bort is “bert”, we will get a BertTokenizer for our Bort model, rather than the needed RobertaTokenizer.

Fortunately, Simple Transformers is prepared to resolve this situation! When initialising a ClassificationModel we can pass the following parameters:

tokenizer_type: The type of tokenizer (auto, bert, xlnet, xlm, roberta, distilbert, etc.) to use. If a string is passed, Simple Transformers will try to initialize a tokenizer class from the available MODEL_CLASSES. Alternatively, a Tokenizer class (subclassed from PreTrainedTokenizer) can be passed.

tokenizer_name: The name/path to the tokenizer. If the tokenizer_type is not specified, the model_type will be used to determine the type of the tokenizer.

What does that mean for Bort? We can now define usage of the RoBERTa tokenizer with a BERT model type:

This works!

But only for the ClassificationModel. For other models, e.g. MultiLabelClassificationModel, the tokenizer_type and tokenizer_name keywords are not available.

To get you started if you want to implement this yourself for other model classes - after all, you are super impatient! - below is the relevant code from ClassificationModel. You can port this to other model classes and then install Simple Transformers from your modified source.

With the above I could use Bort with my existing code based on Simple Transformers. Good!

However…

Bort requires a specific fine-tuning algorithm, called Agora , that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the algorithm to make Bort fine-tuning work.

What does that mean in practice? It means that fine-tuning of Bort will currently give you results at a slightly worse level than BERT. The performance improvements that are possible with Bort will come only when the fine-tuning algorithm is available. I’m impatiently waiting for that!

--

--

Cees Roele
Riga Data Science Club

Language Engineer, Python programmer, Scrum Master, Writer