Training BERT from Scratch on Your Custom Domain Data: A Step-by-Step Guide with Amazon SageMaker

25 min readJan 18, 2023

TLDR: This article provides a comprehensive guide on how to train a language model like BERT for a specific domain using SageMaker on AWS. Users will learn how to acquire and prepare raw data, create custom vocabularies and tokenizers, perform intermediate training, compare the custom domain-specific model against traditional fine-tuned BERT models for downstream tasks such as text classification and evaluate the custom model using tasks like mask filling. Additionally, this article provides readers with a detailed blueprint of an end-to-end architecture that can be built for a very common modern NLP need purely built using SageMaker components, that is resilient, scalable, and distributed. The article includes 12 detailed Jupyter notebooks and supporting scripts for readers to follow along and implement the techniques discussed. Key concepts covered include transfer learning, language models, intermediate training, perplexity, distributed training, and catastrophic forgetting.

Natural language understanding (NLU) is a specialized subfield of artificial intelligence (AI) that concentrates on a computer’s capacity to comprehend, interpret and respond to human language in a manner analogous to human cognition. It is predominantly employed to process and analyze large scale natural language data such as spoken or written text and can be utilized for various applications such as chatbots, language translation and sentiment analysis. NLU has a close correlation with natural language processing (NLP), which focuses on a computer’s ability to analyze and manipulate language data, however, NLU extends beyond the mere analysis of words and sentences in a text by taking into account the intended meaning and context of the language. The domain of NLP and NLU has undergone a significant transformation with the advent of deep learning and the incorporation of transformers and BERT (Bidirectional Encoder Representations from Transformers) in recent years.

Transformers are a class of deep neural network architecture that was introduced in 2017 by researchers at Google. These networks employ attention mechanisms to process input data in a way that enables them to capture long-range dependencies and relationships within the data, which is essential for tasks such as language translation and summarization. BERT, on the other hand, is a specific instantiation of the transformer architecture that is specifically designed for NLP and NLU tasks. BERT has been demonstrated to significantly surpass previous models on a multitude of NLP benchmarks and has been extensively adopted by researchers and industry. One of the major ways in which transformers and BERT have changed the landscape of NLP and NLU is by enabling the training of highly-accurate large language models (LLMs) with minimal human intervention.

LLMs are a form of machine learning (ML) models that are typically trained on an extensive amount of text data, such as books, articles, and other written materials, in order to learn the statistical patterns and relationships within the language. LLMs can be employed for a variety of NLP tasks and have exhibited impressive performance on a range of benchmarks. Traditional NLP and NLU systems necessitate a considerable quantity of labeled training data and meticulous feature engineering to achieve desirable performance. However, BERT and other transformer-based models can learn to execute multiple language tasks with minimal or no labeled data and can be fine-tuned for specific tasks with relatively minimal effort. This has facilitated the development and deployment of NLP and NLU systems, and has enabled the creation of a new generation of sophisticated, language-aware applications.

Transfer learning is the methodology that enables the fine-tuning of LLMs for specific tasks or domains using relatively insignificant amounts of labeled training data. This is achieved by commencing with an LLM that has already been trained on a vast amount of general-purpose text data and then adjusting the model’s parameters based on the new training data in order to augment its performance on the specific task or domain. Transfer learning can significantly reduce the quantity of labeled training data and computational resources required to train an LLM for a specific task and can often enhance the model’s performance on that task.

Amazon SageMaker is a fully-managed service that empowers data scientists and developers to swiftly construct and train ML models, and subsequently deploy them into a production-ready hosted environment.

HuggingFace, on the other hand, is a vast open-source community hub for pre-trained deep learning models, primarily aimed at NLP. Their core method of operation for NLP revolves around the utilization of Transformers. SageMaker natively supports the processing, training, fine-tuning, and hosting of ML models for NLU in Hugging Face. This integration is available through the development of Hugging Face AWS Deep Learning Containers (DLCs). These containers include Hugging Face Transformers, Tokenizers, and the Datasets library, which allows you to use these resources for processing datasets, training models, and hosting inference endpoints.

The purpose of this article is to demonstrate the end-to-end workflow of creating a LLM such as BERT from scratch. The post centers on the individual steps that must be undertaken to pre-train a custom BERT model using your own closed domain industry-specific data. Here, you will learn how to extract your own custom vocabulary and create custom tokenizers from your data using SageMaker with ease. You will also learn how to perform intermediate pre-training of your LLM by leveraging the custom vocabulary and tokenizer you created previously. Finally, you will learn how to utilize your custom-built BERT to handle an NLP downstream task of text classification.

Intermediate training

When a LLM is trained using a large, generic corpus, it is referred to as “pre-training”. When the same LLM, after pre-training, is adapted to a particular downstream task, it is referred to as “fine-tuning”. During both pre-training and fine-tuning phases, there are updates to the model weights. If your data is domain-specific (e.g. legal, finance, biomedical, etc.) and substantially differs from the “standard” open-domain corpus that was used to train the LLM, you have two options to incorporate this custom knowledge into the LLM:

i) train your own custom LLM from scratch via “intermediate” pre-training
ii) leverage a domain-specific LLM that is publicly available

Some examples of domain-specific LLMs include LEGAL-BERT trained on legislation, court cases, and contracts, FinBERT trained on financial services corpus and BioBERT trained on biomedical literature. Both of these models are available for public use in the HuggingFace hub for free.

“Intermediate training” pertains to the process of continuing to train the original pre-trained BERT model on a new, intermediate dataset in order to enhance the model’s performance on a specific domain. This is typically conducted in between the initial training of the model on a large, general-purpose open-domain dataset (Wikipedia, book corpus, etc.) and the final fine-tuning of the model on a small, task-specific dataset (open or closed domain). This further pre-training is useful as it allows the model to learn more about the specific domain that it will be used for.

Intermediate training BERT on custom domain data usually performs better than just fine-tuning BERT on the same data for several reasons. One reason is that training the model from scratch on the custom data allows the model to learn the specific patterns and relationships within the data more thoroughly. This can lead to a more accurate and effective model, as it will be better suited to the specific task or domain. Intermediate training is also frequently referred to as “continued pre-training”, “further pre-training” and “domain adaptation” in the scientific literature. Normally, pre-training from scratch is recommended if you have access to a large closed-domain corpus with sufficient compute resources needed for the pre-training mostly in the form of GPUs.

Intermediate pre-training is an unsupervised learning task that is akin to the initial pre-training task, where your dataset does not have to be labeled. Intermediate training is typically followed by the standard fine-tuning for a downstream task, which utilizes labeled data, and is a supervised learning task. LLMs can be developed using two modeling strategies:

i) Masked Language Modeling (MLM)
ii) Causal Language Modeling (CLM)

The distinction between MLM and CLM is that MLM is a type of pre-training that is utilized to train transformer-based language models, such as BERT, while CLM is a type of pre-training that is utilized to train autoregressive language models, such as GPT (Generative Pre-trained Transformer). While MLM involves masking a portion of the input text and then training the model to predict the masked words based on the context of the surrounding words, CLM involves training the model to predict the next word in a sequence based on the previous words in the sequence.

For intermediate pre-training, you will be using BERT for the exercises in this article. BERT is a transformer-based model, and thus, you can either use MLM standalone or use MLM in combination with Next Sentence Prediction (NSP) to incorporate knowledge about your custom corpus into the model. NSP is a task specific to BERT and it helps BERT learn about the relationships between sentences by predicting if a given sentence follows the previous sentence or not. Another crucial consideration when embarking on intermediate pre-training is the starting point of the process. One may choose to initiate the process with randomly initialized weights or with the end weights of the initial pre-training phase. Opting for the former approach, where the model parameters of BERT are initialized randomly, resets all the prior learning that BERT has undergone to acquire common language understanding.

Given that this initial pre-training is based on a vast amount of general-purpose closed-domain knowledge sources such as Wikipedia and book corpus, this necessitates that the model must now learn the patterns and relationships of common language from scratch and may require an exorbitant amount of your custom data equivalent to the original volume of open-domain data that was used for the initial pre-training phase. Furthermore, this also implies the need for significant computational resources (GPUs) to attain optimal performance. Conversely, when continuing to pre-train BERT on custom data initialized with the weights of an already pre-trained BERT model, we preserve the original learning of common language understanding. The model can now focus on learning the specific patterns and relationships within the custom data. This considerably reduces the amount of data and computational resources required to pre-train a custom language model, and often leads to better performance on tasks and queries related to your custom domain knowledge.

In this article, you will delve into the process of creating a custom LLM from scratch via intermediate pre-training. To demonstrate this methodology, you will undertake the following steps:

I. Acquiring a dataset that is pertinent to your custom domain. This dataset should be as extensive and varied as possible, and should comprise examples of the language and concepts that are unique to your custom domain.
II. Pre-processing the dataset to prepare it for intermediate pre-training. This includes extracting a custom vocabulary and creating a custom tokenizer. Once the tokenizer is established, tokenizing the text and converting it to an appropriate input format for BERT MLM.
III. After the dataset is pre-processed, you will commence the intermediate pre-training process for BERT MLM utilizing the end weights of the pre-trained model as a starting point.
IV. Once the custom language model is pre-trained, you will evaluate this model against the original pre-trained BERT. You will also evaluate this custom BERT against the original BERT that is fine-tuned on your custom data, meaning, this fine-tuning employs the default vocabulary and tokenizer that are part of the original BERT.
V. Ultimately, you will fine-tune the custom BERT on a smaller labeled dataset that is related to a specific NLP downstream task, in this case, text classification. You will also learn how to evaluate the performance of this fine-tuned custom BERT against the original BERT model which is also fine-tuned on the same dataset for text classification.

As an example, you will apply the aforementioned steps to create a custom BERT named “CovidBERT” that will be trained on a Kaggle dataset of news articles that relate to the coronavirus and were collected during the global pandemic. The original BERT model was trained in early 2019, thus, the model was never trained on coronavirus research, nor does it even have a vocabulary entry for some of the key terms we would need for our application like “coronavirus,” “COVID-19”, “COVID”, etc.

In summary, the process of training a custom BERT model from scratch necessitates the acquisition of a substantial corpus of text data that is relevant to the specific domain of interest. This data must be preprocessed, including the extraction of a custom vocabulary and the creation of a bespoke tokenizer, before being utilized for intermediate pre-training of the BERT model. The performance of the custom BERT model must then be evaluated and fine-tuned as necessary to enhance its performance. This process is computationally demanding and may necessitate the utilization of specialized hardware, such as GPUs, to be completed in a reasonable amount of time. Amazon SageMaker is a valuable tool that can be utilized to scale each of these steps and streamline the training process.

I. Acquiring the dataset

For this article, you will utilize a Kaggle dataset of news articles pertaining to the COVID-19 pandemic, which was collected over a period of two years, from the onset of the virus in early 2020 to the decline of average daily cases in spring of 2022. The dataset comprises approximately half a million articles, encapsulating diverse information pertaining to the virus, including its various waves, emergence of different variants, and other noteworthy events. The dataset is comprised of three columns: title, content, and category. The title column denotes the headline of the news article, the content column pertains to the text content, and the category column denotes the overarching context of the news article at a high level, comprising five categories: business, environmental, social and governance (ESG), general, science and technology.

To proceed with the subsequent stages, you will need to prepare the dataset in its raw form, making it ready for both BERT intermediate pre-training and the final fine-tuning for the downstream task. For intermediate pre-training, you can combine all the news articles into a single text file named ‘covid_articles.txt’, utilizing only the text content from the ‘content’ column. Each line in this file will represent an article. It is noteworthy that this dataset does not have labels, as intermediate pre-training is an unsupervised task that involves MLM. To derive the dataset for the downstream text classification task, you can select the ‘title’ and ‘category’ columns from the acquired raw dataset and further filter the selected headlines by matching on keywords such as ‘virus’, ‘COVID’, ‘pandemic’, and ‘variant’ to condense it and make it more specific to the custom domain of the COVID-19 pandemic.

Here, the category column serves as the target variable for the downstream text classification task. The process of acquiring and extracting the relevant dataset for intermediate pre-training and fine-tuning can be efficiently executed using Amazon SageMaker Studio’s web-based Integrated Development Environment (IDE). The IDE includes a number of features that make it easier for data scientists and developers to build, train, and deploy ML models. Among these features is the SageMaker Studio Notebook, a web-based Jupyter notebook that allows users to interactively write and run code, including code for working with ML models. The notebooks come pre-installed with many popular ML libraries, enabling users to get started quickly without the need for additional software installations. The notebook for data extraction can be executed on an ml.m5.2xlarge instance, and is set to use the Python 3 (Data Science) kernel. The notebook 01-prepare-datasets.ipynb in the accompanying code samples can be utilized to download and extract the datasets necessary for the subsequent steps of this blog post.

II. Customizing vocabulary

Let’s say you work in the pharmaceutical industry and wish to develop an NLP application that comprehends and categorizes news content pertaining to Covid-19. An apparent method would be to utilize the original BERT, available in HuggingFace hub and fine-tune it directly using the dataset extracted for classification in the previous step. However, as the original BERT was created in 2019, prior to the emergence of the pandemic, the model may possess a deficient understanding of technical nomenclature specific to Covid-19.

To augment the capabilities of your NLP application, two options are available: i) to continue pre-training BERT using your domain-specific data with BERT’s default vocabulary and tokenizer, or ii) to continue pre-training BERT using your domain-specific data, creating your own custom vocabulary and tokenizer. When training a BERT model from scratch, it is often necessary to employ a custom vocabulary that is attuned to the domain for which the model is being trained. This is due to the fact that the original pre-trained BERT model uses a vocabulary based on general-purpose text data, such as Wikipedia and the book corpus, which may not be appropriate for your specific domain. Therefore, it is essential to utilize a custom vocabulary that is more suitable for your domain. In this case, as you will be training a BERT model on a dataset of Covid-19 related news articles, the custom vocabulary you extract will comprise of terminologies and jargon that are not included in the vocabulary of the original model.

Utilizing a custom vocabulary for BERT intermediate training offers several advantages. Firstly, it enhances the performance of the model on the specific domain it is being trained on by allowing the model to better capture the patterns and relationships within the data and produce more accurate predictions. Secondly, using a custom vocabulary can minimize the size of the model, thereby simplifying the training and deployment process by reducing the amount of computation and memory required. In this exercise, we will not be reducing the size of the vocabulary and will maintain the original count of 30522 tokens.

The above figure illustrates the process of extracting a custom vocabulary using a SageMaker Processing job. Amazon SageMaker Processing is a fully managed service that allows for easy pre- and post-processing of data for ML on SageMaker. This service offers processors for various popular data processing and ML frameworks such as Spark, Dask, Sklearn, and more. Additionally, users are able to utilize their own custom containers with SageMaker Processing. In 2022, SageMaker Processing launched support for HuggingFace, enabling the creation of natural language processing pipelines through the use of pre-installed and optimized processors for common HuggingFace data transformations, including vocabulary extraction and text tokenization. The SageMaker Python SDK in conjunction with SageMaker Studio notebooks can be utilized to create and execute SageMaker Processing jobs, making it easy to integrate ML into existing applications and workflows. In the current use case, the processing job uses a single ml.g4dn.xlarge instance with a single Nvidia GPU and completes the custom vocabulary extraction in approximately 15 minutes.

BERT’s default tokenizer, using the default vocabulary, tokenizes the test sentence “covid19 is a virus” to “[‘[CLS]’, ‘co’, ‘##vid’, ‘19’, ‘is’, ‘a’, ‘virus’]”. With the newly extracted custom vocabulary, the new tokenizer is now able to encode the same sentence differently, as “[‘[CLS]’, ‘covid19’, ‘is’, ‘a’, ‘virus’, ‘[SEP]’]. We can also obtain a token ID for the word “covid19” using the newly created custom tokenizer. The notebook 01-extract-vocabulary.ipynb in sample examples demonstrates the entire custom vocabulary extraction process using a SageMaker Processing job.

III. Tokenizing datasets

Tokenization is an essential pre-processing step when utilizing BERT or other LLMs. BERT processes text in the form of tokens, which are individual elements of the input text such as words, punctuation marks, or special tokens like [CLS] and [SEP]. Tokenization involves splitting the input text into these individual tokens, which is necessary because BERT models are trained on vast amounts of text data and the data must be divided into individual tokens before it can be input into the model. This enables the model to efficiently process and learn from the text. Additionally, tokenization helps the model understand the structure of the input text, such as sentence boundaries and word relationships, which is crucial for many natural language processing tasks.

A benefit of separating the tokenization step from the training of a BERT model is that it enables the use of the same tokenizer for multiple different training runs. This can be useful if you want to train multiple BERT models with different hyperparameters or on different datasets but want to use the same tokenization scheme for all of them. By pre-processing the data and generating the tokens before training, you can avoid repeating the tokenization step for each training run, thus saving time and computational resources.

Having already extracted the custom vocabulary in the previous step, it is possible to use this vocabulary in conjunction with the covid_articles.txt file as inputs for another SageMaker Processing job. This job is tasked with collating, splitting, tokenizing, concatenating, and chunking the 477,537 news articles into data splits that are ready for intermediate training. For this tokenization job, an ml.g4dn.xlarge instance with a single Nvidia GPU is utilized.

As part of this job, the BERT tokenizer is re-created using the extracted custom vocabulary. The vocabulary size is set to 30522 and the model_max_len parameter is set to 512. The input data is read and collated to create mini-batches for masked language modeling on the total news articles. The collated dataset is then split into train and validation splits of 90:10. Once the dataset is split, it is tokenized using the re-created custom tokenizer. The tokenized datasets include all the necessary components for the pre-training step, such as input_ids, token_type_ids, attention_mask, and word_ids. As a final step, the tokenized datasets are concatenated and chunked into fixed lengths. After this process, the number of rows in the train set increases to 3,918,934 and the validation set equals 441959. The chunked datasets are then saved to Amazon S3. The overall processing job completes in approximately 6 to 7 minutes. The workflow for the tokenization step is depicted in the figure above.

The Notebook 02-preprocess-mlm-custom.ipynb in the example samples guides through the process of tokenizing the raw news articles and preparing them for the subsequent step of masked language modeling (MLM). Notebook 03-preprocess-mlm-oob.ipynb utilizes the default vocabulary and tokenizer of the original BERT to tokenize the dataset. This version of the tokenized dataset will be used later for comparative analysis. Similarly, Notebooks 04-preprocess-clf-custom.ipynb and 05-preprocess-clf-oob.ipynb are utilized for tokenizing the classification dataset. The first notebook employs a custom vocabulary, while the second utilizes the default vocabulary of the original BERT. These tokenized datasets will be used later for comparing the original BERT against the covid BERT for the classification task.

IV. Tailoring BERT for Your Domain

Most contemporary NLP systems employ a standard approach for training new models for various use cases, known as “first pre-train then fine-tune.” The objective of pre-training is to leverage large amounts of unlabeled text and construct a general model of language understanding before fine-tuning it for specific NLP tasks such as text classification, summarization, and machine translation, among others. An extension of this standard process is intermediate pre-training, where the model is further pre-trained on a closed domain dataset, thereby teaching the model to understand concepts specific to that domain. This step of intermediate pre-training is a common practice in industrial AI, where one may have a large trove of proprietary data with unique characteristics that necessitate an extended pre-training step.

For intermediate pre-training, we can start with BertForMaskedLM, the masked language modeling (MLM) variant for BERT, which is a model class offered in the transformers library from HuggingFace. Simply put, this is the BERT model with a masked language modeling head on top, allowing us to continue the pre-training procedure and generate the CovidBERT. However, it is important to note that this model class does not include the next sentence prediction (NSP) task, which can be found under the BertForPreTraining class that includes both an NSP head and an MLM head.

For the step, we will utilize SageMaker Training, which offers efficient and streamlined methods for training large deep learning models in the cloud. SageMaker utilizes partitioning algorithms that automatically split large deep learning models and training datasets across multiple AWS GPU instances, reducing the time required for manual partitioning. This is achieved through two techniques:

i) data parallelism
ii) model parallelism

Model parallelism divides models that are too large to fit on a single GPU into smaller parts, distributing them across multiple GPUs for training, while data parallelism splits large datasets to train concurrently, thereby improving training speed. SageMaker’s distributed training libraries provide both data parallel and model parallel training strategies, utilizing software and hardware technologies to enhance inter-GPU and inter-node communications, and providing built-in options that require minimal changes to training scripts.

For our intermediate training, we will use the data parallelism strategy, utilizing 4 GPU instances of type ml.p4d.24xlarge. Each of these instances comes equipped with 8 Nvidia A100 GPUs and a total of 96 vCPUs. The context size for the BERT tokenizer is set to 512, and the chunk size is set to 128. The entire training process is carried out over a total of 50 epochs, with a batch size of 32. The distribution strategy is set to “data parallelism”. The training completes in approximately 8 hours, and at the end of the last epoch, the perplexity for the model is reduced to 4.82 from its original value of 35145.9.

Perplexity is a measure of how well a probability model, such as a statistical language model, is able to predict a sample. It is often utilized as a metric to evaluate the performance of a language model, where a lower perplexity value signifies that the model is more proficient at generating text that is similar to the sample.

Catastrophic forgetting

Catastrophic forgetting is a phenomenon in which a model that has been trained on one dataset forgets the patterns and relationships it has learned when it is trained on a new dataset.

This can make it challenging or impossible to train a model on multiple datasets of differing distributions, or to fine-tune a pre-trained model on a new dataset that is dissimilar to the original dataset. Catastrophic forgetting occurs in deep learning models when new training data causes the model’s parameters to change, overwriting previously learned patterns and relationships. In this scenario, training a model on a new dataset of cat photographs without taking precautions may cause the model to forget its ability to accurately recognize dogs in photographs, despite gaining the ability to recognize cats.

Intermediate training can help to prevent catastrophic forgetting by allowing the model to learn about the new dataset without forgetting the patterns and relationships it has learned from the previous dataset. This is possible due to the nature of transfer learning, where the model starts with a set of parameters that have already been trained on a large, general-purpose dataset, and then fine-tunes the parameters on the intermediate dataset. This allows the model to learn from the intermediate dataset without forgetting what it has learned from the previous dataset, and can lead to better performance on the final fine-tuning step. For example, intermediate training would allow the model to learn from the Covid-related concepts and terminologies without forgetting the original open domain learning of the original BERT and could lead to better performance on both tasks.

Comparison of CovidBERT and Original BERT on Masked Language Modeling Task

In order to evaluate the performance of our previously trained CovidBERT, we will compare it against a standard, out-of-the-box (OOB) BERT model as well as an OOB BERT model that will be fine-tuned on a dataset of COVID-related news articles. To perform this comparison, we have chosen the task of mask filling as it is a suitable benchmark for evaluating the performance of language models.

We will fine-tune the OOB BERT model on the COVID-related news articles with fewer epochs (5) compared to the intermediate training (50) for CovidBERT. This is done to observe the difference in performance between the models and also to demonstrate the efficiency of fine-tuning as a means of improving model performance. The perplexity of the out-of-box BERT model before fine-tuning is 25.29, and it drops to 4.38 in about 5 epochs of fine-tuning. This is because fine-tuning allows the model to utilize the pre-trained weights and adjust them slightly instead of training from scratch, thus allowing the model to converge faster.

In masked language modeling, filling in the mask refers to the process of predicting the missing or masked tokens in a sentence or text. In this task, a portion of the input text is replaced with a special token, typically “[MASK]” and the model is trained to predict the original token that was replaced. MLM is important because it allows the model to learn the context of words in a sentence and understand the relationships between words. By training on a large corpus of text with masked tokens, the model can learn to predict the missing tokens based on the context of the surrounding words. This improves the model’s ability to understand the meaning of text and perform other NLP tasks such as text classification, machine translation, and question answering. Additionally, MLM can also help to improve the model’s ability to understand idiomatic phrases and figures of speech, which are often difficult for models to learn through traditional supervised learning methods.

CovidBERT vs Original BERT on mask filling task

The results of our evaluation (illustrated above) reveal the exceptional performance of CovidBERT as compared to both the original BERT and fine-tuned BERT when it comes to predicting the names of pharmaceutical companies and recognizing drug names that are contextually related to the coronavirus. This can be observed in instances where CovidBERT’s proficiency in accurately predicting the names of pharmaceutical companies is exemplified, as demonstrated in the sentence “Pfizer and [MASK] are both testing new formulations of their vaccine tailored to the omicron variant.” Similarly, CovidBERT’s aptitude in recognizing drug names that are contextually related to the coronavirus is highlighted, as portrayed in the input “A drug called [MASK] appears to actually work against the coronavirus that causes COVID-19,” where a majority of the drug names predicted were related to COVID-19. Our analysis reinforces the significance of intermediate pre-training in imparting domain-specific knowledge to the model and the value of extracting and utilizing custom vocabulary as seed knowledge. Furthermore, it illustrates that CovidBERT is resilient to catastrophic forgetting, retaining knowledge of older real-world information inherited from the original BERT. This can be inferred by examining the table results closely, where words predicted in bold are distinct words that were predicted by all 3 models. The CovidBERT seems to enhance common knowledge more apt for the era of covid, it is able to predict terminologies like lockdown, pandemic, tele-health, variants of virus names with precision, validating it as the preeminent model.

The intermediate training of CovidBERT and the creation of other candidate models for the comparative evaluation discussed in this article can be found in the GitHub repository that accompanies this article. Notebook 01-pretrain-from-scratch.ipynb is used to train BERT from scratch (intermediate training), while Notebook 02-oob-with-finetuning.ipynb demonstrates how to fine-tune the original BERT on our dataset. This approach requires a lesser number of epochs and computational resources as compared to training from scratch. Lastly, notebook 03-oob-without-finetuning.ipynb showcases how to leverage the OOB BERT model to take on the mask-filling task, which is less optimal compared to intermediate training and fine-tuning, as it does not leverage the specific vocabulary and concepts of the target domain. The notebooks provide a clear and cohesive guidance for the readers to follow along and understand the process of training domain specific language models, comparing them and evaluating them.

V. Original BERT vs CovidBERT for Text Classification

Fine-tuning for downstream tasks is a prevalent method for building a classification model using BERT, as it utilizes a limited labeled dataset that is custom-built for that specific task. However, in many cases, the application (classifier) requires specific keywords, terminologies, and concepts from a specific domain that may not be reflected in the training corpus used by BERT or its variants, resulting in a model that is poorly suited for the application. Despite this, fine-tuning a pre-trained model like BERT requires a relatively small number of epochs, with the authors of BERT recommending 2 to 4 epochs. For our experiment, we set the number of epochs 2, using 4 instances of ‘ml.p3.16xlarge’ with a max length of 512 and chunk size of 128, and a batch size of 8. The evaluation results indicate that COVID-BERT outperforms the regular BERT for both validation and holdout datasets. The Notebook 01-finetune-for-multiclass-clf-oob.ipynb fine-tunes the original BERT for classification, while Notebook 02-finetune-for-multiclass-clf-custom.ipynb fine-tunes the masked language model trained from scratch for classification

For evaluating the performance of CovidBERT against the original BERT model fine-tuned on COVID-related news articles, we chose to use F-Score macro. This method provides an accurate representation of the performance of the model by taking the average of the precision and recall for each class, and treating each class equally important. As the classification task is to assign the news articles into five distinct categories, such as business, environmental, social and governance (ESG), general, science and technology, F-Score macro is an appropriate metric to use. It allows us to compare the performance of the two models on each class and make an informed decision on which model performs better.

CovidBERT vs Original BERT on text classification task

Final Remarks

In conclusion, our experiments demonstrate that the model we have trained, known as CovidBERT, is superiorly suited as a language model for COVID-19 related tasks. This further validates the importance of continued pre-training or domain-specific fine-tuning in industry-specific applications, particularly when there is a vast amount of domain-specific data available. The intermediate pre-training approach that we have adopted allows the model to learn new concepts and terminologies specific to the COVID-19 domain, while not compromising on the general language understanding it has acquired from the original BERT model. This enables the model to perform better on the task of predicting missing tokens in a text, and ultimately improve its performance on other NL tasks.

All the necessary code for the steps covered in this article is consolidated in a public GitHub repository that you can find here. Each notebook is clearly explained in the article, detailing its purpose and how it aligns with the overall process. As a culminating step, we also have a Notebook 01-deploy-model.ipynb that guides users through the deployment of the classifier into a real-time endpoint for inference using Amazon SageMaker Hosting. This service offered by SageMaker allows for the seamless deployment of ML models in a production environment, thus enabling real-time predictions to be made based on the classifier we trained as a final step.

Thank you for taking the time to read and engage with this article. Your support in the form of following me and clapping the article is highly valued and appreciated. If you have any queries or doubts about the content of this article or the shared notebooks, please do not hesitate to reach out to me via email at arunprsh@amazon.com or shankar.arunp@gmail.com. You can also connect me on https://www.linkedin.com/in/arunprasath-shankar/

I welcome any feedback or suggestions you may have. If you are an individual passionate about ML on scale, NLP/NLU, and interested in collaboration, I would be delighted to connect with you. Additionally, If you are an individual or part of a startup, or enterprise looking to gain insights on Amazon Sagemaker and its applications in NLP/ML, I would be happy to assist you. Do not hesitate to reach out to me.

­­­Training BERT from Scratch on Your Custom Domain Data: A Step-by-Step Guide with Amazon SageMaker