Inference your own NLP trained-model on AWS SageMaker with PyTorchModel or HuggingeFaceModel

Published in

Innovation-res

5 min readJul 11, 2022

Welcome to a quick tutorial on creating two real-time inference endpoints, utilizing AWS PyTorch Inference Deep Learning Containers (DLCs) and Hugging Face Inference DLCs with AWS SageMaker Python SDK. We will deploy an NLP model to classify publication abstracts into one or more classes (multi-target classification). The binary classes that an abstract can belong to are ‘Machine Learning’, ‘Computer Science’, ‘Physics’, ‘Mathematics’, ‘Biology’, ‘Finance-Economics’.

The main scope of this post is to deploy your NLP model using the AWS PyTorch or Hugging Face Inference Deep Learning Container.

If you want to create an endpoint for an object detection model you can visit this post by George Bakas.

Table of content

Saved model
Custom inference.py script
Create model.tar.gz file
Deploy an endpoint with AWS SageMaker Python SDK

1. Saved model

Suppose we have trained our model locally or on SageMaker and saved it, you could save it in any format you want. Below is an example of how to save it as .pth file:

import os
import torch# ... trained `model`, then save it to `model_dir`
with open(os.path.join(args.model_dir, 'model.pth'), 'wb') as f:
    torch.save(model.state_dict(), f)

2. Custom inference.py script

Create custom inference.py script by overwriting the existing functions (some awesome sources to learn more about inference.py script by AWS docs and HuggingFace):

model_fn loads the model, the return value will be used in predict_fn function
input_fn takes the request deserializes it then pre-processing the input data, the return value will be used in predict_fn function
predict_fn makes the prediction, the return value will be used in output_fn function
output_fn post-processing the output and return the response request

First, we will show the inference.py script for PyTorchModel AWS SageMaker Python SDK and then with HuggingFaceModel.

2a. for PyTorchModel

We need to overwrite the model_fn function, the argument model_diris the path to the unzipped model.tar.gz. Additionally, we will overwrite the input_fn to get a paper’s abstract (text), tokenize it, and return the torch tensors to be fed into the model. Also, we will modify the predict_fn which takes as arguments the input_data (the returned tensors from input_fn) and the model (the returned loaded model from model_fn). The predict_fn function evaluates the model on the input data and returns a dictionary with the predictions.

PytTorchModel inference.py script

2b. for HuggingFaceModel

Let’s see how the inference.py script is modified in case of using the HuggingFaceModel AWS SageMaker Python SDK.

3. Create and upload the model.tar.gz file

Construct the necessary format inside the model.tar.gz file. We can create this locally and then upload it to an AWS S3 bucket (more on this in a second).

The structure of the model.tar.gz file should be as follows:

model.tar.gz/
├── model.pth
└── code/
    ├── inference.py
    └── requirements.txt

Create a requirements.txt file for the extra needed packages. We just need the transformers library, all the other needed libraries are already in the PyTorch container by AWS.

Now that we have all the components, let’s create the .tar.gz file with the model.pth file and the code directory, as shown above. We can use the Linux command:

tar zcvf model.tar.gz model.pth ./code

Before AWS SageMaker hosting services can serve our model, we have to upload the model artifacts (model.tar.gz)to an S3 bucket where SageMaker can access it.

4. Deploy an endpoint with AWS SageMaker SDK

Create a notebook instance on AWS SageMaker

Open the created notebook instance.

4a. using PyTorchModel (and the related 2a. inference.py script)

Get an IAM role with permissions to create an Endpoint and an S3 location with the path to your trained SageMaker model.

Note: the used S3 bucket does not exist, it is just for demonstration.

import sagemaker
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializersagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
s3_location = 's3://sdim-nlp/model.tar.gz'

We create a PyTorchModel object, passing the location of the model weights and the inference script. We also select a PyTorch framework version that should match the one we use to train the model.

pytorch_model = PyTorchModel(
    model_data=s3_location, 
    role=role,
    framework_version='1.10', 
    py_version="py38",
    entry_point='inference.py'
    )

Check the pricing on https://aws.amazon.com/sagemaker/pricing/ to choose the instance_type you need, here we used ‘ml.m4.xlarge’ for real-time inference.

We also pass the number of instances we need, the endpoint name, the deserializer (to deserialize the Invoke request body into an object we can perform prediction on), and the serializer (to serialize the prediction result into the desired response content type).

predictor = pytorch_model.deploy(
    instance_type='ml.m4.xlarge',
    initial_instance_count=1,
    endpoint_name='sdim-tagger',
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
    )

4b. using HuggingFaceModel (and the related 2b. inference.py script)

from sagemaker.huggingface import HuggingFaceModel
import sagemakerrole = sagemaker.get_execution_role()
s3_location = 's3://sdim-nlp/model.tar.gz'

Here we chose a specific container image, passing in the argument image_uri , you can find other container images on AWS here.

huggingface_model = HuggingFaceModel(
    model_data=s3_location,  
    role=role,
    transformers_version="4.17", 
    pytorch_version="1.10.2",
    py_version="py38", 
    image_uri ='763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04'
    )predictor = huggingface_model.deploy(
    instance_type="ml.m4.xlarge",
    initial_instance_count=1,
    endpoint_name='sdim-tagger-hf'
    )

Let’s see an example of real-time inference, we create a sample and send it to the endpoint.

Note: this is the same with either PyTorchModel or HuggingFaceModel

data = {'text': "Although holographic duality has been regarded as a complementary tool in helping understand the non-equilibrium dynamics of strongly coupled many-body systems, it still remains an open question how to confront its predictions quantitatively with the real experimental scenarios. By taking a right evolution scheme for the holographic superfluid model and matching the holographic data with the phenomenological dissipative Gross-Pitaeviskii models, we find that the holographic dissipation mechanism can be well captured by the Landau form, which is expected to greatly facilitate the quantitative test of the holographic predictions against the upcoming experimental data. Our result also provides a prime example how holographic duality can help select proper phenomenological models by invalidating the claim made in the previous literature that the Keldysh self energy can serve as an effective description of the holographic dissipation in superfluids."
}predictor.predict(data)

output:
{‘Machine Learning’: 0.005, ‘Computer Science’: 0.003, ‘Physics’: 0.997, ‘Mathematics’: 0.007, ‘Biology’: 0.226, ‘Finance-Economics’: 0.002}

Note: if you do not need the model and endpoint, do not forget to delete them.

predictor.delete_model()
predictor.delete_endpoint()

References:

[1] AWS SageMaker Python SDK
[2] Custom inference script. HuggingFace on AWS SageMaker. by Philipp Schmid
[3] Fine tune a PyTorch BERT model and deploy it with Elastic Inference on Amazon SageMaker