Leveraging AWS SageMaker Serverless Inference for Customized Model Serving

Published in

Picus Security Engineering

14 min readJun 19, 2023

Machine learning has become a driving force in modern applications, enabling data-driven decision-making and automation in cyber security domain. However, the deployment and management of machine learning models at scale can be challenging. To simplify this process, at Picus, we have built an end-to-end MLOps infrastructure using AWS SageMaker, leveraging its powerful tools and features for building, training, and deploying machine learning models.

In the world of MLOps, where every stage of the pipeline is crucial, model serving plays a significant role. It is the stage at which your trained model becomes accessible to clients or applications, transforming raw input data into actionable predictions. This process is often referred to as “inference”, as it is where the model derives results from the input data. Selecting the appropriate method and location for serving models is vital for achieving efficiency and scalability in your MLOps pipelines.

In this blog post, we will explore the advantages of using serverless inference in AWS SageMaker for model serving. We will also discuss how it seamlessly fits into the larger MLOps landscape and delve into a real-world implementation of model deployment with a serverless endpoint in the following content:

SageMaker Inference Types
SageMaker Components
Model Preparation and Packaging
- Custom Inference Implementation
- Third Party Libraries for Inference Environment
Model Registering
- Create Model Package in Model Registry
- Create Model in SageMaker Inference Models
- Register Model using Custom Docker Image
Model Deployment
- Generate Endpoint Configuration
- Generate Endpoint
- Update Endpoint
Model Inference Request
Conclusion

SageMaker Inference Types

In SageMaker Inference, you can either set up an endpoint that returns inferences or run batch inferences on your model. Whether model is developed in SageMaker or not, since the artifacts are provided, model can be served through SageMaker Endpoints.

Currently, AWS SageMaker providing 4 different inference types which are:

Real Time Inference: Dedicated EC2 instances of your choice, persistent and fully managed endpoint (Rest API). Suitable for low latency or high throughput requirements.
Serverless Inference: Suitable for intermittent or unpredictable traffic patterns. Serverless endpoint(Rest API), no instance selection. Supports maximum memory up to 6 GB, payload sizes up to 4 MB and processing times up to 60 seconds.
Batch Transform: Suitable for offline processing of large amount of data. No need for persistent endpoint.
Asynchronous Inference: Suitable when there is need for queuing requests with large payloads. Supports payloads up to 1 GB and long processing times up to one hour.

For more information about SageMaker Inference, you can refer to the related documentation.

In our case, due to unpredictable request traffic, we opted for serverless inference, which offers several advantages:

Cost Efficiency: You only pay for the compute resources consumed during the actual inference process, avoiding costs associated with idle resources.
Automatic Scaling: Serverless inference automatically scales with the number of incoming requests, ensuring optimal resource utilization without manual intervention.
Simplified Management: The underlying infrastructure is abstracted away, allowing you to focus on your model and its performance rather than managing servers.
Flexibility: Serverless inference supports custom pre-processing and post-processing logic as well as additional third-party libraries, enabling tailored solutions for specific use cases.

However, serverless inference does have a few limitations that are tolerable in our case:

Maximum memory is limited with 6GB, since our models require less memory, this does not effects our decision.
Payload support is up to 4 MB which fits the case and 60 seconds processing time is sufficient.
There is a cold start, since endpoint is serverless. If service is called after a long time, artifacts and additional libraries for inference need a loading time. It is tolerable in our case.
Total concurrent invocation is up to 1000 which is enough to handle our requests.

SageMaker Components

Since our choice is Serverless Inference, other options are out of scope of this post. Now, let’s explain the components of SageMaker inference deployment steps. SageMaker console provides an Inference menu for model deployment operations.

The ‘Models’ section contains the registered model with its model configurations. The model artifact path and the required environment variables, which are used by the model in the endpoint, are stored in this section.
Before endpoint generation a configuration needs to be defined under ‘Endpoint Configurations’ section. The configuration holds one or more model information along with the resources they require.
In the ‘Endpoints’ section, the generated endpoints can be viewed. Monitoring options, such as invocation metrics, endpoint metrics, and logs, are available within the endpoint pages.sds

All of these operations can be performed both from the console and programmatically. Before delving into the implementation details of our structure, it is worth mentioning another SageMaker component called the SageMaker Model Registry, located in SageMaker Studio.

Model Registry helps to keep versions of the Models that either developed in our ML Pipeline or provided as off-the-shelf. When we have a model to deploy into endpoint, model is registered to both Model Registry and Inference Models. This is because:

Serverless Endpoint loads model using registered models under SageMaker Inference section.
Model Registry keeps track of the versions of models along with their configurations. Additionally, by updating the model status in the model registry, event-based model deployment can be triggered.

Now, we have an understanding of which SageMaker components helps us to serve our models through Serverless Endpoints. Let’s talk about our model deployment process and how we handle them.

Model Preparation and Packaging

We might have off-the-shelf pre-trained models or SageMaker trained models. In either case, we should have model artifact, stored in s3 and using this artifact, model can be registered into SageMaker Model Registry and SageMaker Inference Models.

Let’s have a look a simple case for SageMaker trained models:

# Amazon S3 path for model output
model_path = f"s3://{default_bucket}/{s3_prefix}/xgb_model"

# Retrieve xgboost image
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.7-1",
    py_version="py3",
    instance_type=training_instance_type,
)

# Configure training estimator
xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path=model_path,
    sagemaker_session=sagemaker_session,
    role=role,
)

# Set hyperparameters
xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=100,
    max_depth=4,
    eta=0.01
)

# Fit model
xgb_train.fit({"train": train_input})

# Retrieve model data from training job
model_artifacts = xgb_train.model_data

During the training, model_path is provided to decide the location of artifact.
After training artifact path can be reached from model_data
Artifact will be a tar.gz file stored on the S3.

Since our cases require a custom operation before or after model call in the endpoint, we do not use the model artifact directly. Instead, we employ a custom inference layer, defined by a script following a specific convention.
In the inference script, model is loaded and relevant operations are applied to input data, then result is returned by overwritten function.

If we have a pre-trained model (ex: HuggingFace etc.), it can be packed-up with a proper convention. Below code samples provides information about downloading HuggingFace Transformer and Sentence Transformer models.

# Fetch and save a pre-trained Transformer model
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")
model = AutoModel.from_pretrained("facebook/bart-large")

# Output model path
model_path = "model/"

model.save_pretrained(save_directory=model_path)
tokenizer.save_pretrained(save_directory=model_path)

# Fetch and save a pre-trained Sentence Transformer model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Output model path
model_path = "model/"

model.save(model_path)

Custom Inference Implementation

After downloading the model, a custom inference script will be implemented. Custom inference script is a crucial component in the deployment process of a model on AWS SageMaker, particularly when using serverless inference. It defines how the input data should be processed, how the model should be used for generating predictions, and how the output should be formatted before returning to the client. In the context of SageMaker, the serverless infrastructure utilizes the custom inference script to execute the model when handling incoming requests.

import joblib
import os
import json
from transformers import AutoTokenizer, AutoModel

# Deserialize fitted model or loads model from open-souce resources.
def model_fn(model_dir):
    """
      Args:
        model_dir: the directory where model is saved.
      Returns:
        The model 
    """
    # Load tokenizer from disk.
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Load model from disk.
    model = AutoModel.from_pretrained(model_path)
    
    return model

# Takes request body, checks content type and returns input data
def input_fn(request_body, request_content_type):
    """
      Args:
        request_body: The body of the request sent to the model.
        request_content_type: (string) specifies the format/variable type of the request
      Returns:
        Input data in json format.
    """
    if request_content_type == 'application/json':
        request_body = json.loads(request_body)
        inp_var = request_body['Input']
        return inp_var
    else:
        raise ValueError("This model only supports application/json input")

# Generate prediction from input data
def predict_fn(input_data, model):
    """
      Args:
        input_data: Returned input data from input_fn
        model: Returned model from model_fn
      Returns:
        The predictions
    """
    return model.predict(input_data)

# Generates response with intended format
def output_fn(prediction, content_type):
    """
      Args:
        prediction: The returned value from predict_fn
        content_type: Content type that endpoint expects to be returned. Ex: JSON, string
      Returns:
        Output data in dict
    """
    res = int(prediction[0])
    resp_json = {'Output': res}
    return resp_json

By providing a custom inference script, you can customize the behavior of the model to suit your specific use case. This ensures that the serverless endpoint can effectively serve your model while accommodating various data formats and processing requirements.

When deploying a model using a SageMaker serverless endpoint, the custom inference script is bundled with the model and utilized by the serverless infrastructure to handle incoming requests. To package the model, you can use the following shell command to create a tar.gz file. Afterward, the file should be uploaded to an S3 path.

tar -cvpzf model.tar.gz model inference.py

Third Party Libraries for Inference Environment

Another case, model may need additional libraries than provided in the endpoint environment. In that case, we can handle additional libraries in model packaging phase. A requirement.txt file should be included alongside your model artifacts and custom inference script in the model.tar.gz. This way, when the serverless endpoint is created, SageMaker automatically installs the specified dependencies in the environment before running the custom inference script. Proper convention for the folders would be like below:

├── code
│   ├── inference.py
│   └── requirements.txt
└── model
    └── custom_model.bin

We have gone further for the customization for model estimator, and created a custom one.

Implemented a custom class that inherits BaseEstimator, TransformerMixin classes to enable model call to preprocess input and post-process output of the model.
Generated an Scikit-Learn Pipeline object with custom class and pickled them into a serialized file.
Added custom class into model package, while loading model, it would understand the serialized class itself.

Final folder tree structure will be like below:

├── code
│   ├── custom_estimator.py
│   ├── inference.py
│   └── requirements.txt
└── model
    └── custom_model.bin

Sample Custom Estimator class implementation is as below:

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import re


class CustomEstimator(BaseEstimator, TransformerMixin):
    def __init__(self, model, tokenizer=None):
        self.tokenizer = tokenizer
        self.model = model

    def _preprocess(self, input_data):
        content = pd.Series(input_data)
        content = content.apply(lambda x: x.replace('.', ' '))
        content = content.apply(lambda x: x.replace('-', ' '))
        content = content.apply(lambda x: re.sub(' +', ' ', x))
        return content.tolist()

    def fit(self, X, y=None):
        return self

    def transform(self, inputs):
        preprocessed_inputs = self._preprocess(inputs)
        embeddings = self.model.encode(preprocessed_inputs)

        return embeddings.tolist()

Creating a custom estimator class for model prediction allowed us to encapsulate the model, preprocessing, and postprocessing logic within a self-contained structure, promoting maintainability and reusability. Also, it abstracts the underlying model implementation from the custom inference script, simplifying the process of updating or adapting the model to new requirements.

Final model packaging using Scikit-Learn Pipelines became as following:

from custom_estimator import CustomEstimator

with open('custom_model.pickle', 'wb') as f:
    # Initialize CustomEstimator with model object 
    custom_estimator = CustomEstimator(model)
    
    # Create a pipeline
    sk_pipe = Pipeline([("estimator", custom_estimator)])
    
    # Output pipeline as pickle
    pickle.dump(sk_pipe, f)

The packaged model can be loaded in the inference script model_fn function using the model path argument.

import pickle
import os

def model_fn(model_dir):
    # Build the path
    model_path = os.path.join(model_dir, 'model/')
    
    # Load model from disk.
    with open(os.path.join(model_path, "model.pickle"), 'rb') as f:
      model = pickle.load(f)
    
    return model

Model Registering

Now we have a model package stored in S3 and ready for registration. As mentioned at the beginning, model is registered in both Model Registry and SageMaker Inference Models. By registering models in SageMaker, your deployments follow a consistent process. This approach also makes it easier to manage multiple models, keep track of versions, and simplify the process of updating or rolling back deployments as needed.

Model Registration (a.k.a model creation) can be done using SageMaker SDK as well as Boto3 SageMaker client. In this section, we will talk about the Boto3 way.

Create Model Package in Model Registry

We create a new model package in Model Registry using Boto3 client to keep the model versions and history. Also, deployment process is handled using model status change in Model Registry, which will be explained later.

Pick an image using SageMaker SDK or image uri among provided pre-built AWS docker images.

import sagemaker

# From sagemaker sdk
image_uri = sagemaker.image_uris.retrieve(
    framework="sklearn",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type="ml.m5.xlarge",
)

# From pre-build aws docker images
image_uri = '<aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-cpu-py39-ubuntu20.04'

2. Define Model Specifications and create model in Model Registry

# Initialize sagemaker session
sess = sagemaker.Session()
sagemaker_role = sagemaker.get_execution_role(sagemaker_session=sess)

# Initialize boto3 sagemaker client
region = sess.boto_region_name
sm_client = boto3.client("sagemaker", region_name=region)

# Define Model specifications 
modelpackage_inference_specification = {
    "InferenceSpecification": {
        "Containers": [
            {
                "Image": image_uri,
                "ModelDataUrl": model_path
            }
        ],
        "SupportedRealtimeInferenceInstanceTypes": ["ml.t2.medium", "ml.m5.large"],
        "SupportedTransformInstanceTypes": ["ml.m5.large"],
        "SupportedContentTypes": ["application/json"],
        "SupportedResponseMIMETypes": ["application/json"],
    }
}

create_model_package_input_dict = {
    "ModelPackageGroupName": model_package_group_name,
    "ModelApprovalStatus": "PendingManualApproval"
}
create_model_package_input_dict.update(modelpackage_inference_specification)

# Create Model in Sagemaker Studio Model Registry
create_model_package_response = sagemaker_client.create_model_package(**create_model_package_input_dict)

Create Model in SageMaker Inference Models

# Checks the latest model version in Model Registry and returns the max version
def get_latest_model_package_version(model_package_group_name, sagemaker_client):
    response = sagemaker_client.list_model_packages(ModelPackageGroupName=model_package_group_name)
    model_versions = []
    for package in response['ModelPackageSummaryList']:
        model_versions.append(package['ModelPackageVersion'])
    return max(model_versions) 
    
model_version = get_latest_model_package_version(model_package_group_name, sagemaker_client)

# Create Model in Sagemaker Inference Models
model_package_arn = create_model_package_response["ModelPackageArn"]

model_name = model_package_group_name + str(model_version)

create_model_response = sagemaker_client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": image_uri,
            "Mode": "SingleModel",
            "ModelDataUrl": model_path,
            "Environment": {
                'SAGEMAKER_SUBMIT_DIRECTORY': model_path,
                'SAGEMAKER_PROGRAM': 'inference.py',
                'SAGEMAKER_REGION': region_name,
                'LOG_LOCATION': "/tmp",
                'METRICS_LOCATION': "/tmp"
            },
        }
    ],
    ExecutionRoleArn=sagemaker_role,
}

SAGEMAKER_SUBMIT_DIRECTORY: The directory within the container in which the Python script for inference is located.
SAGEMAKER_PROGRAM: The Python script that should be invoked and used as the entry point for inference.
LOG_LOCATION and METRICS_LOCATION environment variables are set in the case MMS(Multi Model Server) which is the model serving architecture in pre-built AWS image, cannot write logs to a proper folder and throws exceptions in the endpoint.

Register Model using Custom Docker Image

In the case of you need modifications in the endpoint docker image that is not provided by pre-built AWS images and cannot be extended, you can generate your own custom image and create model by this image uri.

We will provide an example docker image folder structure based on latest provided extended model package structure. Also details can be found in AWS documentation.

├── Dockerfile
├── app.py
├── requirements.txt
├── .env
└── src
    ├── __init__.py
    ├── custom_estimator.py
    └── inference.py

Dockerfile: The Dockerfile
app.py: A python web application server that overrides GET /ping and POST /invocations on port 8080
requirement.txt: Related third party libraries.
.env file contains the environment variables (ECR repository etc.) that is used in Github Action.
/src/custom_estimator.py: Custom class implementation of estimator.
/src/inference.py: Custom inference implementation.

Custom Docker image should be pushed to an ECR repository and repository uri is used as image uri.

One disadvantage of this implementation is that all related inference files need to be included in the image. As a result, if any modification or change is required in the inference, the image must be rebuilt and redeployed to the endpoint. This can make the deployment process longer. However, if frequent implementation changes are not necessary, opting for a custom Docker image is still a viable option.

Model Deployment

Model deployment consist of 3 steps:

A registered model in SageMaker Inference
An endpoint configuration
An endpoint

Endpoint configuration and endpoint can be created using Boto3 SageMaker client. In our infrastructure, we designed an event based deployment. Since models are additionally registered to Model Registry, we are able to track the model versions and statuses. Model status can be PendingManualApproval, Approved or Rejected. When a model version status changes, an event generated in EventBridge. We have implemented a Lambda job that listens Sagemaker Model Package State Change event and when it is generated with Approved status, creates or updates the Serverless Inference Endpoint.

Generate Endpoint Configuration

Endpoint Configuration contains the necessary settings and information required to create and manage an endpoint for serving predictions.

An endpoint configuration specifies details such as:

Model(s) to deploy: The model(s) you want to deploy to the endpoint. This includes the reference to the model stored in the SageMaker Inference Model menu.
Initial variant weight: In the case of deploying multiple models or model versions (A/B testing or multi-model endpoints), the initial variant weight determines how the traffic is distributed among the different models or model versions.
Additional settings: Any other settings that might be required for your specific use case, such as serverless configurations, custom environment variables, network settings, or encryption settings. Custom environment variable can be provided also in registered model itself.

# Generates Sagemaker Endpoint Config
def generate_serverless_endpoint_config(client, model_arn, memory_in_mb, max_concurrency):
    """
      Args:
        client: Sagemaker boto3 client
        model_arn: Related sagemaker model arn
        memory_in_mb: Memory to allocated by endpoint
        max_concurrency: Max concurrency provided in endpoint
      Returns:
        Endpoint config name
    """

    model_name = generate_model_name(model_arn)
    endpoint_config_name = f"{model_name}-serverless-endpoint-config"

    # Check if endpoint config exists
    response = client.list_endpoint_configs()
    endpoint_config_list = [e['EndpointConfigName'] for e in response['EndpointConfigs']]

    if endpoint_config_name in endpoint_config_list:
        logger.info(f'Endpoint config {endpoint_config_name} exists. Deleting ...')
        # Delete existing config
        client.delete_endpoint_config(
            EndpointConfigName=endpoint_config_name
        )

    logger.info(f'Creating endpoint config {endpoint_config_name} with model {model_name}...')
    create_endpoint_config_response = client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ProductionVariants=[
            {
                "ServerlessConfig": {"MemorySizeInMB": memory_in_mb, "MaxConcurrency": max_concurrency},
                "ModelName": model_name,
                "VariantName": "AllTraffic",
            }
        ],
    )
    logger.info(f"{endpoint_config_name} is generated with model {model_name} ...")
    logger.info(
        f"{endpoint_config_name} Endpoint Configuration Arn: {create_endpoint_config_response['EndpointConfigArn']}")
    return endpoint_config_name

Generate Endpoint

Using the created Endpoint Configuration, Endpoint can be generated through create_endpoint method of Boto3 client.

# Generates Sagemaker Serverless Endpoint
def generate_serverless_endpoint(client, endpoint_name, endpoint_config_name):
    """
      Args:
        client: Sagemaker boto3 client
        endpoint_name: Specified endpoint name to generate
        endpoint_config_name: Endpoint config name bound to endpoint
    """
    
    create_endpoint_response = client.create_endpoint(
        EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
    )

    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)

    while describe_endpoint_response["EndpointStatus"] == "Creating":
        describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
        logging.info(f"Creation status of {endpoint_name} is {describe_endpoint_response['EndpointStatus']}")
        time.sleep(15)

    logger.info(f"{endpoint_name} is generated with arn {create_endpoint_response['EndpointArn']}")

Update Endpoint

In the case of update need of available Endpoint, update_endpoint method can be used.

# Update Sagemaker Endpoint using a new endpoint config
def update_serverless_endpoint(client, endpoint_name, endpoint_config_name):
    """
      Args:
        client: Sagemaker boto3 client
        endpoint_name: Specified endpoint name to generate
        endpoint_config_name: Endpoint config name bound to endpoint
    """
    update_endpoint_response = client.update_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name)

    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)

    while describe_endpoint_response["EndpointStatus"] == "Updating":
        describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
        logging.info(f"Update status of {endpoint_name} is {describe_endpoint_response['EndpointStatus']}")
        time.sleep(15)

    logger.info(f"{endpoint_name} is updated with arn {update_endpoint_response['EndpointArn']}")

Model Inference Request

SageMaker endpoint is available through REST request using public uri provided in SageMaker Console or through SageMaker Boto3 runtime client.

# Sagemaker Boto3 Runtime Client
sm_client = boto3.client(service_name="sagemaker-runtime")

response = sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body="this is a test",
    ContentType="text/csv",
)

print(response["Body"].read())

import requests
import json
from requests_auth_aws_sigv4 import AWSSigV4

# Through REST Requests
aws_auth = AWSSigV4('sagemaker',
aws_access_key_id=<AWS_ACCESS_KEY>,
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>,
aws_session_token=<AWS_SESSION_TOKEN>,
)

headers = {'Content-Type': 'application/json'}
payload = {'input': 'this is a test'}

r = requests.request('POST', 'https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/<serverless_endpoint_name>/invocations',
    data=json.dumps(payload),
    headers=headers,
    auth=aws_auth)
    
print(json.loads(r.content))

Conclusion

In this post, I am glad to share our SageMaker Serverless Inference experience with practices in Picus. AWS SageMaker Serverless Inference provides an efficient, scalable, and easy-to-manage solution for model serving, forming an integral part of modern MLOps pipelines. By understanding its benefits and following the outlined steps for implementation, you can harness the power of serverless inference in SageMaker to accelerate your MLOps workflows and enhance the overall efficiency of your machine learning projects. Sometimes it is hard to find relevant documentation with detailed explanation. In that case, we believe that experience sharing is important with community for further developments.