How to Create a SageMaker Multi Model Endpoint With a Custom Docker Container and Model

Martin Ye
Loopio Tech
Published in
6 min readApr 28, 2023
Photo by RoonZ nl on Unsplash

Machine Learning (ML) models can be deployed individually to single endpoints. However at scale, managing what can possibly be hundreds to thousands of different endpoints can become complex and time-consuming. Instead, we can leverage AWS SageMaker’s abilities to host multiple models in a single endpoint. There are various benefits to using multi-model endpoints over having individual endpoints for each model. These include:

  • Reduced costs as the ability to serve multiple models from a single instance can help prevent over provisioning.
  • Improved scalability by reducing the overhead of managing multiple endpoints.
  • Allows for the use of auto-scaling policies and batch processing on multiple models simultaneously.

Multi-model endpoints also have the ability to seamlessly add/update models which prevents any disruptions to existing services.

What is AWS SageMaker

AWS SageMaker is a fully managed ML service that provides a complete end-to-end solution for building, training, and deploying models at scale. One of the benefits of using AWS SageMaker is the ability to deploy models to endpoints, which allows for easier integration of models into applications and services.

In order to use SageMaker for this purpose, the models need to be packaged and deployed as endpoints using Docker images.

What is Docker

Docker images are lightweight, portable, and self-contained environments with all the dependencies and configurations needed to run an application. SageMaker already provides a range of pre-built Docker containers for specific deep learning frameworks such as TensorFlow, PyTorch and MXNet. A list of these can be found here. However, sometimes we may want to deploy custom models not built on frameworks that are natively managed by AWS. In this case, custom docker images need to be created to meet specific requirements or use cases.

Flowchart on deciding if a custom container is needed. Image from Amazon.

Setting Up Custom Models in SageMaker

We will go through the process of setting up multiple text classification FastText models for SageMaker deployment. FastText models are not directly supported by SageMaker. This means we will need to build our own custom AWS Elastic Container Registry (ECR) image. For some background, FastText is an open-source, free, lightweight library used for text representation and classification. It was created by Facebook’s AI Research (FAIR) team and can be used as a standalone library or with other deep learning frameworks.

In this article, we will be covering the following steps of this process:

  1. Creating a Custom Docker container
  2. Creating and uploading the model artifacts
  3. Creating the Endpoint
  4. Invoking the Endpoint

Note that we will be working with FastText models that are stored in arrow format (.ftz).

An Example on how Real Time Inference Endpoints work with AWS SageMaker. Image from Amazon.

Creating the Custom Docker Container

As we have a model that is not directly supported by AWS SageMaker, we will need to create a custom Docker container to host our model on. This Docker container will handle the loading of our model as well as the inferencing. To do this we will be using this open source multi model server (MMS) provided by AWSLabs: https://github.com/awslabs/multi-model-server

Our directory will look something like this:

/
|-Dockerfile
|-dockerd-entrypoint.py
|-model_handler.py

Dockerfile

For the Dockerfile, we can use one of the many prebuilt ones provided by MMS which can be found here. Just make sure that you update the dockerfile to also install the FastText library and MXNet as well as any other libraries you may need.

We also will include SageMaker-inference as well since we will need a method from that library. SageMaker will periodically ping your models’ container endpoint in order to check whether the endpoint is healthy and able to serve requests; this means that we must be able to respond to these pings with a 200 status. We can set this up ourselves or we can use the model_server provided by AWS SageMaker Inference which handles this for us. The source code can be found here. In our case we will be using the provided model_server.

For example, if you go to the link provided above and choose the first option (Dockerfile.cpu) on line 23 we would make the following changes:

RUN pip install --no-cache-dir multi-model-server \ 
&& pip install --no-cache-dir mxnet\
&& pip install --no-cache-dir SageMaker-inference\
&& pip install -no-cache-dr fasttext

Docker Entrypoint

In the Dockerfile we set up the entry point to be the dockerd-entrypoint.py file. In this file we:

  1. start our model server
  2. set the service handler function which will be the handle method in the model_handler.py file.

Model handler

When we start the model server we have the ability to specify our request handler service. In our case we have it set to be the handle function inside the model_handler file. This means that whenever we invoke the endpoint, this handle function will be called. It will also be called when we first start up the endpoint.

Our model_handler.py file will look something like this:

The handle function is responsible for both initializing the model as well as inferencing. When we first start up the endpoint, it will check whether the model is initialized or not. If it is not, the handler will initialize the model where SageMaker will automatically:

  1. unzip the tarballs in the S3 bucket that you specify
  2. load the model artifacts to the path opt/ml/model/ directory which is what the context.system_properties.get(‘model_dir’) will give us.

In the predict_label method we are assuming that the requests come in as a list which allows us to predict multiple entries at once. Your predict function may look different depending on your use case and requirements.

Now we can build and deploy our container to AWS ECR. Make sure you have a repository created in AWS ECR. This can be created through the ECR console. Once this has been completed make sure you copy the image uri which we will need.

Creating the Model Artifacts

AWS SageMaker expects model artifacts to be uploaded to an S3 bucket as a tarball. This example was executed in a Databricks notebook but any environment of your choice should be fine. Note that in the example code below, we have 2 models and each model is in its own folder.

/
|- model1
| |- 1.ftz

We do not need any other files in the tarball except for the .ftz file which is the model itself.

Creating the Endpoint Config and Endpoint

Create SageMaker Model

Now we move on to actually creating the multi-model endpoint. In order to do this we will first need to create a SageMaker model.

Create Endpoint Config

Make sure you have the adequate permissions for your role in order to be creating models and endpoints in SageMaker. After the model has been created we then create the endpoint config which we will then use to create the actual endpoint.

Create Endpoint

Now that the endpoint config has been created we can now go ahead and create the actual endpoint. It can take a bit of time for the endpoint to be created, so we use the SageMaker waiter in order to see when the endpoint creation is complete.

Check for Completion

To check that the endpoint has been created successfully you can either go into the SageMaker console and check the dashboard or you can run the following code:

The endpoint should say that it is “InService” once creation is successfully complete.

Invoking the Endpoint

Now all that is left is to invoke the endpoint. When invoking a multi-model endpoint we have to specify which model we want to invoke.

Remember that our docker container expects the request data to be in a list, which is why the request_data is a list.

Conclusion

AWS SageMaker’s multi-model endpoints provide a powerful and flexible way to deploy ML models at scale. The ability to create custom Docker containers allows you to further tailor the deployment environment to meet specific requirements and use cases. In this article we demonstrated how you can use a custom Docker container to host FastText models in a multi-model endpoint. Good luck with setting up your custom SageMaker endpoint. Comment below with any comments or questions, and check out some of the other articles from Loopio.

--

--