AWS SageMaker: Train, Deploy and Update a Hugging Face BERT Model

#HuggingFace #AWS #BERT #SageMaker #Mlops

Vinayak Shanawad
Analytics Vidhya
12 min readJul 2, 2022

--

NLP techniques such as tf-idf, word2vec, or bag-of-words (BOW) used to generate word embeddings features which can be used for training text classification models. They have been very successful in many NLP tasks but they don’t always capture the meanings of words accurately when they appear in different contexts.

We achieved the better results in text classification tasks with the help of BERT because of its ability more accurately encode the meaning of words in different contexts.

Amazon SageMaker enables developers to create, train, deploy and monitor machine-learning models in the cloud.

Image from AWS

Table of contents

  1. Problem statement
  2. Goal
  3. Dataset
  4. Amazon SageMaker Training
  5. Types of AWS instances used for model training on SageMaker
  6. Model training using on-demand instances
  7. Model training using spot instances
  8. Model deployment on SageMaker endpoint
  9. Update a SageMaker model endpoint
  10. Conclusion and references

Please note that the objective of this post is not to build a robust model, but rather how to train a HuggingFace BERT model on SageMaker. Let’s go through each step in detail.

1. Problem statement

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

More details here.

2. Goal

Kaggle competition dataset, which consists of fake and real Tweets about disasters. The task is to classify the tweets.

3. Dataset

The dataset consists of labelled fake and real tweets. Labels are provided in the “target” column. There are several features, though for the purpose of this blog-post we will use only “text” field which represents text of the tweet.

Image from Kaggle dataset (Screenshot by Author)

Please refer to this git repo and for data preparation and EDA. Let’s focus on training a HuggingFace BERT model using AWS SageMaker.

4. Amazon SageMaker Training

Image from AWS
  • When a SageMaker training job is started, then SageMaker prepares the instances (using “ml.m5.xlarge” instance in this post) for training.
  • SageMaker downloads or reads input data from Amazon S3, uses that data to train a model.
  • SageMaker pulls the Model training instance container (used Pytorch container in this post but we can also use HuggingFace and TensorFlow containers as well) from Amazon Elastic Container Registry(ECR) then training data from S3 is made available to the Model Training instance container, see the Prebuilt SageMaker Docker Images for Deep Learning.
  • Once the model training is completed then the model training job persists model artifacts back to the output S3 location designated in the training job configuration.
  • When we are ready to deploy a model, SageMaker spins up new ML instances (using “ml.m4.xlarge” instance in this post), pulls the Model inference container (using Pytorch container) from ECR then pulls in these model artifacts from S3 to use for real-time model inference.

5. Types of AWS instances used for model training on SageMaker:

Let’s understand the types of AWS instances which can be used for Model training on AWS SageMaker.

a. On-demand instances:

  • We tell AWS SM (SageMaker) which EC2 instance type we need and how many of them then appropriate instances are created on demand, configured and terminated automatically once the training job is complete.
  • On demand instances are made available when we require them and we need to pay for the time we use them on an hourly basis.
  • On demand instances remain uninterrupted until we terminate the instance or until a training job completes.

b. Spot instances:

  • It’s a kind of auction/bidding for spare unused EC2 instances (free capacity).
  • If our bid price ≥ market (spot) price (which changes in real-time based on demand and supply) then a spot instance will be launched.
  • If free capacity is exhausted or spot price becomes > bid price then spot instance is terminated within 2 minutes of notice or warning.
  • EC2 spot instances help us to lower Machine Learning training costs by up to 90% compared to using on-demand instances in AWS SageMaker.

6. Model training using on-demand instances

Let’s focus on training a HuggingFace BERT model using AWS SageMaker on-demand instances.

Model Training script

We use the PyTorch-Transformers library, which contains PyTorch implementations and pre-trained model weights for many NLP models, including BERT.

Our training script should save model artifacts learned during training to a file path called model_dir, as stipulated by the SageMaker PyTorch image. Upon completion of training, model artifacts saved in model_dir will be uploaded to S3 by SageMaker and will become available in S3 for deployment.

We save this script in a file named train_deploy.py, and put the file in a directory named code/. The full training script can be viewed under code/.

Let’s look at few blocks from train_deploy.py

Import the necessary libraries

Defining the logger and loading a BERT tokenizer

As we know PyTorch offers a solution for parallelizing the data loading process with automatic batching by using DataLoader. Dataloader has been used to parallelize the data loading as this boosts up the speed and saves memory. Let’s get the train and test data loaders.

Getting a train data loader

Getting a test data loader

Define, train a BERT model and save the fine tuned model

Model training using on-demand instances on Amazon SageMaker

The Amazon SageMaker Python SDK makes it easier to run a PyTorch script in Amazon SageMaker using its PyTorch estimator.

To start, we use the PyTorch estimator class to train our model. When creating our estimator, we make sure to specify a few things:

  • entry_point: the name of our PyTorch script. It contains our training script, which loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model. It also contains code to load and run the model during inference.
  • source_dir: the location of our training scripts and requirements.txt file. "requirements.txt" lists packages you want to use with your script.
  • framework_version: the PyTorch version we want to use

After creating the estimator, we then call fit(), which launches a training job. We use the Amazon S3 URIs where we uploaded the training data earlier.

Image from AWS

At this point SageMaker will create the ml.m5.xlarge training instance using the in-built PyTorch Docker image which is available in AWS ECR. It will then download the data and config files from S3 bucket to the SageMaker instance and start the training job.

Create an estimator object and start on-demand instance training

We can monitor the status of our training job by looking inside Amazon SageMaker > Training jobs as shown in below.

Training jobs (Screenshot by Author)

We should see the logs similar to the following which keeps you informed on the status of the training job. The logs are displayed in the notebook and they are also available in AWS CloudWatch logs for future reference.

2022-05-14 09:13:19 Starting - Starting the training job...
2022-05-14 09:13:42 Starting - Preparing the instances for trainingProfilerReport-1652519599: InProgress
......
2022-05-14 09:14:42 Downloading - Downloading input data...
2022-05-14 09:15:02 Training - Downloading the training image...
2022-05-14 09:15:50 Training - Training image download completed. Training in progress...
.........
.........
.........
Train Epoch: 1 [0/5709 (0%)] Loss: 0.752307
INFO:__main__:Train Epoch: 1 [0/5709 (0%)] Loss: 0.752307
Train Epoch: 1 [3200/5709 (56%)] Loss: 0.518363
INFO:__main__:Train Epoch: 1 [3200/5709 (56%)] Loss: 0.518363
Average training loss: 0.473186
INFO:__main__:Average training loss: 0.473186
Test set: Accuracy: 0.825221
INFO:__main__:Test set: Accuracy: 0.825221Train Epoch: 2 [0/5709 (0%)] Loss: 0.367052
INFO:__main__:Train Epoch: 2 [0/5709 (0%)] Loss: 0.367052
Train Epoch: 2 [3200/5709 (56%)] Loss: 0.313773
INFO:__main__:Train Epoch: 2 [3200/5709 (56%)] Loss: 0.313773
Average training loss: 0.356163
INFO:__main__:Average training loss: 0.356163
Test set: Accuracy: 0.824115Saving tuned model.
INFO:__main__:Test set: Accuracy: 0.824115
INFO:__main__:Saving tuned model.
2022-05-14 10:15:36,540 sagemaker-training-toolkit INFO Reporting training SUCCESS

2022-05-14 10:16:16 Uploading - Uploading generated training model
2022-05-14 10:17:16 Completed - Training job completed
ProfilerReport-1652519599: NoIssuesFound
Training seconds: 3755
Billable seconds: 3755

We have trained a BERT model with 2 epochs to demonstrate model training using SageMaker, we see Accuracy on the test set is 82.41%, time taken to complete the model training is 3755 seconds so we need to pay only for 3755 seconds.

We can definitely improve this time by using GPU based instances, I tried it with CPU based instances due to resource limitation on my AWS free-tier account.

7. Model training using spot instances (Managed Spot Training)

  • As discussed earlier model training using spot instances can optimize the cost of training models up to 90% over on-demand instances. SageMaker manages the Spot interruptions on our behalf.
  • Managed spot training is available in all training configs: Single instance training, Distributed training and Automatic model tuning
  • EC2 spot instances can be interrupted or may be claimed at any time if we need more capacity and need to take action ourselves. Interruptions causing training jobs to take longer to start or finish.
  • Don’t worry, AWS SageMaker is a fully managed service and it handle the process automatically.
. Interrupting the training job
. Obtaining adequate spot capacity again
. Either restarting or resuming the training job
  • When spot instances are interrupted, we need to use checkpointing to avoid restarting a training job from scratch.

Managed spot training parameters:

a. use_spot_instances = True

b. max_run = How long a training job should run.

c. max_wait = Total duration of training job = Train time + how long we are ready to wait for spot instances to be available.

Create an estimator object and start managed spot training

Let’s look at the status of model training.

2022-05-14 12:15:48 Starting - Starting the training job...
2022-05-14 12:15:49 Starting - Launching requested ML instancesProfilerReport-1652530547: InProgress
......
2022-05-14 12:17:14 Starting - Preparing the instances for training.........
2022-05-14 12:18:42 Downloading - Downloading input data...
2022-05-14 12:19:08 Training - Training image download completed. Training in progress.
.........
.........
.........
Train Epoch: 1 [0/5709 (0%)] Loss: 0.752307
INFO:__main__:Train Epoch: 1 [0/5709 (0%)] Loss: 0.752307
Train Epoch: 1 [3200/5709 (56%)] Loss: 0.518363
INFO:__main__:Train Epoch: 1 [3200/5709 (56%)] Loss: 0.518363
Average training loss: 0.473186
INFO:__main__:Average training loss: 0.473186
Test set: Accuracy: 0.825221
INFO:__main__:Test set: Accuracy: 0.825221Train Epoch: 2 [0/5709 (0%)] Loss: 0.367052
INFO:__main__:Train Epoch: 2 [0/5709 (0%)] Loss: 0.367052
Train Epoch: 2 [3200/5709 (56%)] Loss: 0.313773
INFO:__main__:Train Epoch: 2 [3200/5709 (56%)] Loss: 0.313773
Average training loss: 0.356163
INFO:__main__:Average training loss: 0.356163
Test set: Accuracy: 0.824115Saving tuned model.
INFO:__main__:Test set: Accuracy: 0.824115
INFO:__main__:Saving tuned model.
2022-05-14 12:55:39,669 sagemaker-training-toolkit INFO Reporting training SUCCESS

2022-05-14 12:55:53 Uploading - Uploading generated training model
2022-05-14 12:56:54 Completed - Training job completed
ProfilerReport-1652530547: NoIssuesFound
Training seconds: 2282
Billable seconds: 508
Managed Spot Training savings: 77.7%

We have trained a BERT model with on-demand instances with 2 epochs in the previous step and it took 3755 seconds to complete model training and billable seconds = 3755.

By looking at the above logs, managed spot instance training completed in 2282 seconds and billable seconds = 508.

Wow.. Managed spot training can save training cost up to 77.7%

8. Model deployment on SageMaker endpoint

After training our model, we host it on an Amazon SageMaker Endpoint. To make the endpoint load the model and serve predictions, we implement a few methods in train_deploy.py.

  • model_fn(): function defined to load the saved model and return a model object that can be used for model serving. The SageMaker PyTorch model server loads our model by invoking model_fn.
  • input_fn(): deserializes and prepares the prediction input. In this example, our request body is first serialized to JSON and then sent to the model serving endpoint. Therefore, in input_fn(), we first deserialize the JSON-formatted request body and return the input as a torch.tensor, as required for BERT.
  • predict_fn(): performs the prediction and returns the result.

To deploy our endpoint, we call deploy() on our PyTorch estimator object, passing in our desired number of instances and instance type:

predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

estimator.deploy() method does 3 things in background.

a. Register a model on sagemaker: Define a name for model and associate to s3 location where model artifacts are stored.

SageMaker model (Screenshot by Author)

b. Create endpoint config: Sagemaker creates model hosting endpoint config for different purposes — A/B testing, development vs. production versions etc.

SageMaker endpoint config (Screenshot by Author)

c. Create endpoint: Finally, it creates an endpoint which is used for prediction.

SageMaker endpoint (Screenshot by Author)

If we look at the config details of the above endpoint then we observe that the registered model on SageMaker is attached to this endpoint which means that endpoint serves the model predictions using the registered model on SageMaker.

SageMaker endpoint production variant (Screenshot by Author)

Model Prediction/Inference

Inference results:

predicted_labels: ['Not a disaster', 'Real disaster', 'Real disaster']I met my friend today by accident ---> Not a disaster
Frank had a severe head injury after the car accident last month ---> Real disaster
Just happened a terrible car crash ---> Real disaster

9. Update a SageMaker model endpoint

We all know that Machine Learning is a highly iterative process. When we are working on any Data science project, data scientists and ML engineers often train thousands of different models in search of maximum accuracy. Indeed, the number of combinations for algorithms, data sets, and training parameters (aka hyperparameters) is infinite.

For example, let’s train a BERT model with updated hyperparameter (epoch = 3) value and see if model performance is improved or not.

If we find the improvement in model performance then we can update the existing SageMaker model endpoint.

Let’s look at the status of model training.

2022-05-14 10:28:20 Starting - Starting the training job...
2022-05-14 10:28:46 Starting - Preparing the instances for trainingProfilerReport-1652524100: InProgress
......
2022-05-14 10:29:46 Downloading - Downloading input data...
2022-05-14 10:30:06 Training - Downloading the training image.....
.........
.........
.........Train Epoch: 1 [0/5709 (0%)] Loss: 0.752307
INFO:__main__:Train Epoch: 1 [0/5709 (0%)] Loss: 0.752307
Train Epoch: 1 [3200/5709 (56%)] Loss: 0.518363
INFO:__main__:Train Epoch: 1 [3200/5709 (56%)] Loss: 0.518363
Average training loss: 0.473186
INFO:__main__:Average training loss: 0.473186
Test set: Accuracy: 0.825221
INFO:__main__:Test set: Accuracy: 0.825221Train Epoch: 2 [0/5709 (0%)] Loss: 0.367052
INFO:__main__:Train Epoch: 2 [0/5709 (0%)] Loss: 0.367052
Train Epoch: 2 [3200/5709 (56%)] Loss: 0.313773
INFO:__main__:Train Epoch: 2 [3200/5709 (56%)] Loss: 0.313773
Average training loss: 0.356163
INFO:__main__:Average training loss: 0.356163
Test set: Accuracy: 0.824115
INFO:__main__:Test set: Accuracy: 0.824115Train Epoch: 3 [0/5709 (0%)] Loss: 0.345569
INFO:__main__:Train Epoch: 3 [0/5709 (0%)] Loss: 0.345569
Train Epoch: 3 [3200/5709 (56%)] Loss: 0.371484
INFO:__main__:Train Epoch: 3 [3200/5709 (56%)] Loss: 0.371484
Average training loss: 0.274791
INFO:__main__:Average training loss: 0.274791

2022-05-14 11:59:06 Uploading - Uploading generated training modelTest set: Accuracy: 0.826327
INFO:__main__:Test set: Accuracy: 0.826327
Saving tuned model.
INFO:__main__:Saving tuned model.
2022-05-14 11:59:03,716 sagemaker-training-toolkit INFO Reporting training SUCCESS

2022-05-14 12:00:12 Completed - Training job completed
Training seconds: 5438
Billable seconds: 5438

By looking at the above results, the model trained with epoch = 3 is performing better than model trained with epoch = 2. So let’s see how we can update the existing SageMaker model endpoint.

Register a model on SageMaker using sm_client.create_model() method.

SageMaker model (Screenshot by Author)

If we look at the config details of the existing endpoint (pytorch-training-2022–05–14–10–19–37–926) then we observe that the latest registered model on SageMaker is attached to the existing endpoint.

SageMaker endpoint production variant (Screenshot by Author)

Let’s look at the inference results.

Inference results:

predicted_labels: ['Not a disaster', 'Real disaster', 'Real disaster']I met my friend today by accident ---> Not a disaster
Frank had a severe head injury after the car accident last month ---> Real disaster
Just happened a terrible car crash ---> Real disaster

10. Conclusion

  • Hopefully this post will help you leverage the power of Amazon SageMaker to train, deploy and update Hugging Face BERT models on your own data using Amazon in-built Deep learning containers.
  • We went through the Amazon SageMaker training process, understood how we can train a model using on-demand instances and spot instances.
  • Model training using spot instances can save the cost of training models up to 90% over on-demand instances.
  • We all know that Machine Learning is a highly iterative process. How one can update the hyperparameters and update a SageMaker model endpoint when there is an improvement on model performance.
  • Finally we have to clean up the SageMaker endpoint to avoid charges.

The complete source code for this post is available in github repo

References

Thanks for reading!! If you have any questions, feel free to contact me.

--

--

Vinayak Shanawad
Analytics Vidhya

Machine Learning Engineer | 3x Kaggle Expert | MLOps | LLMOps | Learning, improving and evolving.