Pretrain your custom BERT model

Published in

VMware Data & ML Blog

3 min readJul 11, 2022

BERT has become the go-to language model for many natural language processing (NLP) use cases. You can easily access BERT from the HuggingFace transformers library and fine-tune it for a downstream task. But there are limitations when dealing with niche domains where using a standard version of BERT yields suboptimal results as BERT isn't familiar with the problem domain.

At VMware, we deal with many technical terms (eg; virtualization), product names (eg; vSphere, vRealize), and abbreviations (eg; VCSA which stands for vCenter Server Appliance) that are not part of BERT's vocabulary. BERT uses WordPiece tokenization to deal with Out-Of-Vocabulary (OOV) terms, but WordPiece tokenization alone cannot solve the issue as the sub-tokens of these words will still lack context and meaning: see Weaknesses of WordPiece Tokenization.

Image comparing custom trained BERT’s MLM outputs to BERT’s outputs — Comparing Masked Token prediction outputs of BERT Large and vBERT Large (custom BERT)

While words such as Tanzu have no meaning to the general public, a person familiar with VMware products would know that Tanzu is a suite of products related to Kubernetes. Similarly, BERT trained on standard English datasets (general public in our analogy) would struggle to make meaningful embeddings out of the domain-specific terms.

To adapt BERT to VMware’s domain, we perform additional pretraining of BERT on VMware content to learn new words and their contexts, making it suitable for VMware-specific downstream tasks.

BERT Pretraining Library

GitHub - vmware-labs/bert-pretraining: The project is a python module that facilitates BERT…

The project is a python module that facilitates BERT pretraining. The current existing open-source solution for…

github.com

To help pretrain a custom VMware-specific version of BERT (vBERT), we developed a python library at VMware’s R&D AI Lab and observed improved results across multiple internal benchmarks (classification, information retrieval, etc).

The BERT Pretraining library uses Transformers, PyTorch, and HuggingFace Accelerate to pretrain your custom BERT model on multiple GPUs with a few lines of code. Now all you need to pretrain your own BERT model is a text corpus and some compute!

Note: the authors of BERT were kind to release the code for pertaining BERT, but it's written in Tensorflow <2 and requires you to modify/write a lot of code if you want to customize the pertaining process. We still use their code to create the pertaining TFrecord file for the training task.

The library has been tested with python 3.7 and 3.8 and supports multi-GPU and TPU training!

To install the library, run:

pip install git+https://github.com/vmware-labs/bert-pretraining

You can run pretraining for your model in 3 simple steps:

Import the run_pretraining function and the Pretraining_Config class
Tailor the Pretraining_Config object to your pretraining needs
Start pretraining! (or additional pretraining)

Let's walk through a code sample for creating a demo version of BERT.

If you want to create a pretraining corpus:

Create a single text file with sentences separated by a '\n', and documents separated by '\n\n.'
Use create_pretraining_data.py from https://github.com/google-research/bert to create the input corpus tf_record file.

Please go through the repository's Readme for more details: https://github.com/vmware-labs/bert-pretraining/blob/main/README.md.

Bias and Fairness

If you perform additional pretraining on the out-of-the-box BERT model, you might encode the biases embedded within BERT or your pretraining corpus. It would be prudent to consider the implications of such biases on downstream tasks: see Measuring Bias in Contextualized Word Representations, Dirty Secret of Book Corpus

Related Work

Want to learn more about using WordPiece tokenization with real-world data?

Weaknesses of WordPiece Tokenization

Findings from the front lines of NLP at VMware

medium.com

Need an OSS tool to help with data annotation for the downstream task after you finish your pretraining?

Introducing Data Annotator for Machine Learning

An end-to-end data annotation platform for machine learning