Introducing FastBert — A simple Deep Learning library for BERT Models

Kaushal Trivedi
May 17, 2019 · 7 min read
Image for post
Image for post

BERT What?

The little Sesame Street muppet has taken the world of Natural Language Processing by storm and the storm is picking up speed. We have seen a number of NLP problems solved by neural network architectures built on top of contextual representations of BERT. To name a few BERT based models have pushed the state of the art for SQUAD 2.0 question answering, GLUE multi task learning, Google natural questions task and Biomedical domain specific tasks — BioBERT.

Google research open sourced the TensorFlow implementation for BERT along with the pretrained weights. This opened the door for the amazing developers at Hugging Face who built the PyTorch port for BERT. With this library, geniuses i.e. developers and data scientists can use BERT models for text classification, question answering, fine tuning language model and more. Yours truly has contributed to the text classification capability by adding the feature for multi-label text classification.

Enter FastBert

FastBert is the deep learning library that allows developers and data scientists to train and deploy BERT based models for natural language processing tasks beginning with Text Classification. The work on FastBert is inspired by and strives to make the cutting edge deep learning technologies accessible for the vast community of machine learning practitioners.

With FastBert, you will be able to:

  1. Train (more precisely fine-tune) BERT text classification models on your custom dataset
  2. Tune model hyper-parameters such as epochs, learning rate, batch size, optimiser schedule and more
  3. Save and deploy trained model for inference (including on AWS Sagemaker)

Starting today, FastBert will support both multi-class and multi-label text classification and in due course, it will support other NLU tasks such as Named Entity Recognition, Question Answering and Custom Corpus fine-tuning. I rely on the community to help make this happen :-)


pip install fast-bert

From Source: pip install git+


Import the required packages. Please note that I have not included the usual suspects such as os, pandas, etc.

Define general parameters and path locations for data, labels and pretrained models. (some good engineering practices)


Create a tokenizer object. The is the BPE based WordPiece tokenizer and is available from the magnificient Hugging Face BERT PyTorch library.

The do_lower_case parameter depends on the version of the BERT pretrained model you have used. In case you use uncased models, set this value to true, else set it to false. For this example we have use the BERT base uncased model and hence do_lower_case parameter is set to true.

GPU & Device

Training a BERT model does require a single or more preferably multiple GPUs. In this step we can setup GPU parameters for our training.

Note that in the future releases, this step will be abstracted from the user and the library will automatically determine the correct device profile.


This is an excellent idea borrowed from library. The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.

The DataBunch object provides the location to the data files and the label.csv file. For each of the data files, i.e. train.csv, val.csv and/or test.csv, the databunch creates a dataloader object by converting the csv data into BERT-specific input objects. I would encourage you to explore the structure of the databunch object using Jupyter notebook.


Another concept in line with the library, BertLearner is the ‘learner’ object that holds everything together. It encapsulates the key logic for the lifecycle of the model such as training, validation and inference.

The learner object will take the databunch created earlier as as input alongwith some of the other parameters such as location for one of the pretrained BERT models, FP16 training, multi_gpu and multi_label options.

The learner class contains the logic for training loop, validation loop, optimiser strategies and key metrics calculation. This help the developers focus on their custom use-cases without worrying about these repetitive activities.

At the same time the learner object is flexible enough to be customised either via using flexible parameters or by creating a subclass of BertLearner and redefining relevant methods.

The learner object does the following upon initiation:

  1. Creates a PyTorch BERT model and initialises the same with provided pre-trained weights. Based on the multi_label parameter, the model class will be BertForSequenceClassification or BertForMultiLabelSequenceClassification.
  2. Assigns the model to the right device, i.e. CUDA based GPU or CPU. if Nvidia Apex is available, the distributed processing functions of Apex will be utilised.

fast-bert provides a bunch of metrics. for multi-class classification, you will generally use accuracy whereas for multi-label classification, you should consider using accuracy_thresh and/or roc_auc.

Train the model

Start the model training by calling fit method on the learner object. the method takes epoch, learning rate and optimiser schedule_type as input. Following schedule types are supported (again courtesy of the Hugging Face Bert library):

  • none: always returns learning rate 1.
  • warmup_constant: Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Keeps learning rate equal to 1. after warmup.
Image for post
Image for post
  • warmup_linear: Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Linearly decreases learning rate from 1. to 0. over remaining 1 - warmup steps.
Image for post
Image for post
  • warmup_cosine: Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Decreases learning rate from 1. to 0. over remaining 1 - warmup steps following a cosine curve. If cycles(default=0.5) is different from default, learning rate follows cosine function after warmup.
Image for post
Image for post
  • warmup_cosine_hard_restarts: Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. If cycles (default=1.) is different from default, learning rate follows cycles times a cosine decaying learning rate (with hard restarts).
Image for post
Image for post
  • warmup_cosine_warmup_restarts: All training progress is divided in cycles (default=1.) parts of equal length. Every part follows a schedule with the first warmup fraction of the training steps linearly increasing from 0. to 1., followed by a learning rate decreasing from 1. to 0. following a cosine curve. Note that the total number of all warmup steps over all cycles together is equal to warmup * cycles
Image for post
Image for post

On calling the fit method, the library will start printing the progress information on the logger object. It will print training and validation losses, and the metric that you have requested.

In order to repeat the experiment with different parameters, just create a new learner object and call fit method on the same. If you have tons of GPU compute, then you can possibly run multiple experiments in parallel by instantiating multiple databunch and learner objects at the same time.

Once you are happy with your experiments, call the save_and_reload method on learner object to persist the model on the file structure.

Model Inference

You have two options to get inference from the model.

Call predict_batch method on the learner object that contains the trained model.

Of course the above method is convenient if you already have a trained learner object in memory. If you have persistent trained model and just want to run inference logic on that trained model, use the second approach, i.e. the predictor object.

And thats how it works…The library repo contains a sample notebook to demonstrate the usage of the library.

Conclusion and next steps

Hopefully this library will help you build and deploy BERT based NLU models within minutes. In the next part, I will describe how to build your training workflow using fast-bert and deploy your trained model as an endpoint on AWS SageMaker. Watch this space

With respect to this library it is very much in early stages of development. I do have a few more ideas with respect to further development of the library. Some of them are:

  1. Add capability to pre-train a BERT language model for custom text corpus
  2. Add other NLU capabilities such as NER, question answering, and more.
  3. Experiment and include additional improvements to BERT by incorporating some of the key innovations in such as learning rate finder, freezing model layers and more.
  4. Add capability for automatic hyper-parameter tuning using AWS SageMaker

As mentioned earlier, this is an community driven initiative. Any help will be very much appreciated.

I would love to hear back from all. Also please feel free to contact me using LinkedIn or Twitter.



Stories @ Hugging Face

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store