Member-only story
Maximizing BERT model performance
An approach to evaluate a pre-trained BERT model to increase performance
TL;DR
Training a BERT model from scratch on a domain specific corpus such as biomedical space with a custom vocabulary generated specific to that space has proven to be critical to maximize model performance in biomedical domain. This is largely because of language characteristics that are unique to biomedical space which is insufficiently represented in the original pre-trained models released by Google (self-supervised training of a BERT model is often called pre-training). These domain specific language characteristics are
- Biomedical space has a lot of terms or phrases unique to that domain — e.g. names of drugs, diseases, genes etc. These terms, or broadly, the entity bias of biomedical domain corpus towards disease, drugs, genes etc. has insufficient representation, from model performance maximization perspective, in the original pre-trained models. The original BERT models ( bert-large-cased/uncased, bert-base-cased/uncased) were pre-trained with a vocabulary with an entity bias that is largely skewed towards people, locations, organizations etc.
- Sentence fragments/structures that are unique to biomedical domain Examples of this are (1) “<disease name> secondary to <drug name>…”…