Inference your own NLP trained-model on AWS SageMaker with PyTorchModel or HuggingeFaceModel
Welcome to a quick tutorial on creating two real-time inference endpoints, utilizing AWS PyTorch Inference Deep Learning Containers (DLCs) and Hugging Face Inference DLCs with AWS SageMaker Python SDK. We will deploy an NLP model to classify publication abstracts into one or more classes (multi-target classification). The binary classes that an abstract can belong to are ‘Machine Learning’, ‘Computer Science’, ‘Physics’, ‘Mathematics’, ‘Biology’, ‘Finance-Economics’.
The main scope of this post is to deploy your NLP model using the AWS PyTorch or Hugging Face Inference Deep Learning Container.
If you want to create an endpoint for an object detection model you can visit this post by George Bakas.
Table of content
- Saved model
- Custom inference.py script
- Create model.tar.gz file
- Deploy an endpoint with AWS SageMaker Python SDK
1. Saved model
Suppose we have trained our model locally or on SageMaker and saved it, you could save it in any format you want. Below is an example of how to save it as .pth file:
import os
import torch# ... trained `model`, then save it to `model_dir`
with open(os.path.join(args.model_dir, 'model.pth'), 'wb') as f:
torch.save(model.state_dict(), f)
2. Custom inference.py script
Create custom inference.py script by overwriting the existing functions (some awesome sources to learn more about inference.py script by AWS docs and HuggingFace):
model_fn
loads the model, the return value will be used inpredict_fn
functioninput_fn
takes the request deserializes it then pre-processing the input data, the return value will be used inpredict_fn
functionpredict_fn
makes the prediction, the return value will be used inoutput_fn
functionoutput_fn
post-processing the output and return the response request
First, we will show the inference.py script for PyTorchModel AWS SageMaker Python SDK and then with HuggingFaceModel.
2a. for PyTorchModel
We need to overwrite the model_fn
function, the argument model_dir
is the path to the unzipped model.tar.gz
. Additionally, we will overwrite the input_fn
to get a paper’s abstract (text), tokenize it, and return the torch tensors to be fed into the model. Also, we will modify the predict_fn
which takes as arguments the input_data
(the returned tensors from input_fn
) and the model
(the returned loaded model from model_fn
). The predict_fn
function evaluates the model on the input data and returns a dictionary with the predictions.
2b. for HuggingFaceModel
Let’s see how the inference.py script is modified in case of using the HuggingFaceModel AWS SageMaker Python SDK.
3. Create and upload the model.tar.gz file
Construct the necessary format inside the model.tar.gz file. We can create this locally and then upload it to an AWS S3 bucket (more on this in a second).
The structure of the model.tar.gz file should be as follows:
model.tar.gz/
├── model.pth
└── code/
├── inference.py
└── requirements.txt
Create a requirements.txt file for the extra needed packages. We just need the transformers
library, all the other needed libraries are already in the PyTorch container by AWS.
Now that we have all the components, let’s create the .tar.gz file with the model.pth
file and the code
directory, as shown above. We can use the Linux command:
tar zcvf model.tar.gz model.pth ./code
Before AWS SageMaker hosting services can serve our model, we have to upload the model artifacts (model.tar.gz
)to an S3 bucket where SageMaker can access it.
4. Deploy an endpoint with AWS SageMaker SDK
Create a notebook instance on AWS SageMaker
Open the created notebook instance.
4a. using PyTorchModel (and the related 2a. inference.py script)
Get an IAM role with permissions to create an Endpoint and an S3 location with the path to your trained SageMaker model.
Note: the used S3 bucket does not exist, it is just for demonstration.
import sagemaker
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializersagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
s3_location = 's3://sdim-nlp/model.tar.gz'
We create a PyTorchModel
object, passing the location of the model weights and the inference script. We also select a PyTorch framework version that should match the one we use to train the model.
pytorch_model = PyTorchModel(
model_data=s3_location,
role=role,
framework_version='1.10',
py_version="py38",
entry_point='inference.py'
)
Check the pricing on https://aws.amazon.com/sagemaker/pricing/ to choose the instance_type you need, here we used ‘ml.m4.xlarge’ for real-time inference.
We also pass the number of instances we need, the endpoint name, the deserializer (to deserialize the Invoke request body into an object we can perform prediction on), and the serializer (to serialize the prediction result into the desired response content type).
predictor = pytorch_model.deploy(
instance_type='ml.m4.xlarge',
initial_instance_count=1,
endpoint_name='sdim-tagger',
serializer=JSONSerializer(),
deserializer=JSONDeserializer()
)
4b. using HuggingFaceModel (and the related 2b. inference.py script)
from sagemaker.huggingface import HuggingFaceModel
import sagemakerrole = sagemaker.get_execution_role()
s3_location = 's3://sdim-nlp/model.tar.gz'
Here we chose a specific container image, passing in the argument image_uri
, you can find other container images on AWS here.
huggingface_model = HuggingFaceModel(
model_data=s3_location,
role=role,
transformers_version="4.17",
pytorch_version="1.10.2",
py_version="py38",
image_uri ='763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04'
)predictor = huggingface_model.deploy(
instance_type="ml.m4.xlarge",
initial_instance_count=1,
endpoint_name='sdim-tagger-hf'
)
Let’s see an example of real-time inference, we create a sample and send it to the endpoint.
Note: this is the same with either PyTorchModel or HuggingFaceModel
data = {'text': "Although holographic duality has been regarded as a complementary tool in helping understand the non-equilibrium dynamics of strongly coupled many-body systems, it still remains an open question how to confront its predictions quantitatively with the real experimental scenarios. By taking a right evolution scheme for the holographic superfluid model and matching the holographic data with the phenomenological dissipative Gross-Pitaeviskii models, we find that the holographic dissipation mechanism can be well captured by the Landau form, which is expected to greatly facilitate the quantitative test of the holographic predictions against the upcoming experimental data. Our result also provides a prime example how holographic duality can help select proper phenomenological models by invalidating the claim made in the previous literature that the Keldysh self energy can serve as an effective description of the holographic dissipation in superfluids."
}predictor.predict(data)
output: {‘Machine Learning’: 0.005,
‘Computer Science’: 0.003,
‘Physics’: 0.997,
‘Mathematics’: 0.007,
‘Biology’: 0.226,
‘Finance-Economics’: 0.002}
Note: if you do not need the model and endpoint, do not forget to delete them.
predictor.delete_model()
predictor.delete_endpoint()
References:
[1] AWS SageMaker Python SDK
[2] Custom inference script. HuggingFace on AWS SageMaker. by Philipp Schmid
[3] Fine tune a PyTorch BERT model and deploy it with Elastic Inference on Amazon SageMaker