TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

Maximizing BERT model performance

An approach to evaluate a pre-trained BERT model to increase performance

Ajit Rajasekharan
TDS Archive
Published in
14 min readNov 4, 2020

--

Figure 1. Training pathways to maximize BERT model performance. For application domains where entity types — people, location, organization etc. are the dominant entity types, training pathways 1a-1d would suffice. That is, we start off with a publicly released BERT model (bert-base/large-cased/uncased, or the tiny bert versions) and optionally train it further (1c — continual pre-training) before fine-tuning it for a specific task (1d — supervised task with labeled data). For a domain where person, location, organization etc. are not the dominant entity types, use of original BERT model for continual pre-training (1c) with a domain specific corpus, followed by fine tuning may not boost performance as much as pathway 2a-2d, given the vocabulary in the 1a-1d pathway is still the original BERT model vocabulary with an entity bias towards people, location organization etc. Pathway 2a-2d trains a BERT model from scratch using a vocabulary that is generated from the domain specific corpus. Note: Any form of model training - pre-training, continual pre-training or fine tuning, modifies both model weights as well as the vocabulary vectors — the different shades of same color model(shades of beige)as well as vocabulary(shades of blue/green) in the training stages from left to right illustrates this fact. The box labeled with a “?”, is the focus of this article — evaluate a pre-trained or a continually pre-trained model to improve model performance. Image by Author

TL;DR

Training a BERT model from scratch on a domain specific corpus such as biomedical space with a custom vocabulary generated specific to that space has proven to be critical to maximize model performance in biomedical domain. This is largely because of language characteristics that are unique to biomedical space which is insufficiently represented in the original pre-trained models released by Google (self-supervised training of a BERT model is often called pre-training). These domain specific language characteristics are

  • e.g. names of drugs, diseases, genes etc. These terms, or broadly, the entity bias of biomedical domain corpus towards disease, drugs, genes etc. has insufficient representation, from model performance maximization perspective, in the original pre-trained models. The original BERT models ( bert-large-cased/uncased, bert-base-cased/uncased) were pre-trained with a vocabulary with an entity bias that is largely skewed towards people, locations, organizations etc.
  • Examples of this are (1) “<disease name> secondary to <drug name>…”

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ajit Rajasekharan
Ajit Rajasekharan

Responses (4)