Create Your Own Docker Container for Model Serving in Sagemaker

11 min readJan 1, 2023

In our previous post, we created our own custom Docker image for model training on AWS Sagemaker. To serve a custom model with our own code logic, we need to introduce some other settings and modules to our Docker image so that Sagemaker can use it to serve the model. Sagemaker mainly provides two main modes of model serving: (1)Batch Transform and (2) Hosting Service(such as Realtime Inference Endpoints). There are also Serverless Inference and Asynchronous Inference that are somehow in between those two main modes for serving specific use-cases. Batch Transform, as its name implies, is suitable for making prediction in bulk, large batches, when needed, without much focus on latency. Realtime Inference Endpoint, on the other hand, is more appropriate for serving real or near-real time models with all-time availability and low latency. In this post, we will cover how to prepare our codes and Docker container to serve a trained model on Sagemaker. In the next post, we will cover how to perform Batch Transform as well as how to deploy Realtime Inference Endpoint.

General Overflow of Interaction Between Sagemaker and Docker Container

Figure 1 shows the general overview of how Sagemaker will consume our Docker Container for model serving purpose. The figures shows only one EC2 instance but Sagemaker can spin up multiple of it if we configure it to do so.

Step 1:
We provide our own Docker Container to Sagemaker for it to use while setting up a special EC2 instance for model serving

Step 2:
It uses that Docker container to spin up an EC2 instance for serving the model

Step 3:
After spinning up the instance, it sends ping request with GET method to the instance to know whether everything is ok or not.

Step 4:
Sagemaker will wait for a response from the ping request to the instance, so the instance must return a special success message if everything is ok, otherwise it should return a failure message. If Sagemaker receives a success message, it will proceed to the Step 5, otherwise it will halt the instance and return a proper failure message.

Step 5:
Sagemaker will send invocation request to the “/invocations” endpoint of the instance with input data using POST method.

Step 6:
Sagemaker will wait for a response from the instance that includes proper predictions. In case of any error, the endpoint should return a proper error message to Sagemaker.

For Batch Transform option:
When all input data is processed and all the predictions are retrieved, Sagemaker will halt the instance, save the predictions to a specified S3 folder, then return a proper message.

For Realtime Inference Endpoint option:
Sagemaker will accepts invocation requests continuously with an all-time up inference instance.

As you may notice, our Docker container must serve endpoints /ping with GET and /invocations with POST method, both specifically on 8080 port. For batch transform option, it may additionally include execution-parameters/ endpoint with GET method to obtain some special parameters for batch transform. In other words, we must have a web server running into our Docker container which serves those endpoints. The simplest way to achieve it by using Flask, Gunicorn and Nginx. Let’s see how we can configure our container with those.

1 - Create One Docker Image For Both Training and Model Serving On Sagemaker

In our previous post, we just created our Docker container for model training in Sagemaker. Now, beside training purpose, we will see how to modify the same Docker file for also model serving purpose. So, we will have one Docker Container for both training and model serving. To continue with the next steps in this post, we must have the setup in our previous post where we created our Dockerfile, Pipfile and Pipfile.lock in the project root, and /src folder in the project root with train file in it:

├── Dockerfile
├── Pipfile
├── Pipfile.lock
└── src
    └── train

1.1 — Installing Relevant Packages Using Pipenv

We need to install following packages to start a web server in the Docker container with Python. So we must go to the project folder’s root that we created in our previous post, and activate the virtual environment using:

# In the root of project folder where our Pipfile and Pipfile.lock resides 
pipenv shell

Then in the activated virtual environment, we need to install the following packages by:

# Within activated virtual environment
pipenv install gunicorn gevent flask

It will install the packages, and update Pipfile and Pipfile.lock files accordingly.

A brief info about those packages:

Gunicorn is an application server. It translates HTTP requests into something Python can understand. Gunicorn implements the Web Server Gateway Interface (WSGI), which is a standard interface between web server software and web applications.
Gevent is the use of simple, sequential programming in python to achieve scalability provided by asynchronous IO and lightweight multi-threading (as opposed to the callback-style of programming using Twisted’s Deferred).
Flask is a popular, extensible web microframework for building web applications with Python.

1.2 — Creating Necessary Files That Don’t Require Much Changes

For model serving, we need to add four extra files into our Docker container:

The explanation of those files from AWS documentation is as follows:

nginx.conf is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.
predictor.py is the program that actually implements the Flask web server and the prediction codes for this app. You’ll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.
serve is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in predictor.py. You should be able to take this file as-is.
wsgi.py is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.

All of the files except predictor.py are pretty standard that require no or very little changes:

The file nginx.conf can be downloaded from HERE
The file serve can be dowloaded from HERE.
The file wsgi.py can be downloaded HERE

⚠️ In serve file, you may need to change the following code:

nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf'])

to:

nginx = subprocess.Popen(['nginx', '-c', '/opt/app/nginx.conf'])

because in our Dockerfile, we defined opt/app, not opt/program.

Now, we need to implement our own predictor.py code.

1.3 — Creating predictor.py File

This is the critical file in which we will define our endpoints and code logic for inference. This file must have certain implementations:

/ping (required) endpoint with GET method to return a proper message to Sagemaker
/invocations (required) endpoint with POST method to accept incoming payloads from Sagemaker and to return the predictions/outputs to Sagemaker
/execution-parameters (only for batch transform, optional) — Allows the algorithm to provide the optimal tuning parameters for a job during runtime. Based on the memory and CPUs available for a container, the algorithm chooses the appropriate MaxConcurrentTransforms, BatchStrategy, and MaxPayloadInMB values for the job.

# This is the file that implements a flask server to do inferences. It's the file that you will modify to
# implement the scoring for your own algorithm.

from __future__ import print_function

import io
import json
import os
import pickle
import signal
import sys
import traceback

import flask
import pandas as pd

prefix = os.environ.get("ARTEFACT_PATH", "/opt/ml/")
model_path = os.path.join(prefix, "model")

FEATURES = os.environ.get("FEATURES", "")
FEATURES = FEATURES.split(",")

# A singleton for holding the model. This simply loads the model and holds it.
# It has a predict function that does a prediction based on the model and the input data.

class ScoringService(object):
    model = None  # Where we keep the model when it's loaded

    @classmethod
    def get_model(cls, model_path):
        """Get the model object for this instance, loading it if it's not already loaded."""
        if cls.model == None:
            with open(os.path.join(model_path, "model.pckl"), "rb") as inp:
                cls.model = pickle.load(inp)
        return cls.model

    @classmethod
    def predict(cls, data):
        """For the input, do the predictions and return them.
        Args:
            input (a pandas dataframe): The data on which to do the predictions. There will be
                one prediction per row in the dataframe"""
        clf = cls.get_model(model_path=model_path)

        if hasattr(clf, "predict_proba"):
            return clf.predict_proba(data)[:, 1]

        if hasattr(clf, "predict"):
            return clf.predict(data)

        raise "Model does not have predict_proba or predict methods"

# The flask app for serving predictions
app = flask.Flask(__name__)

@app.route("/ping", methods=["GET"])
def ping():
    """Determine if the container is working and healthy. In this sample container, we declare
    it healthy if we can load the model successfully."""
    print(model_path)
    health = ScoringService.get_model(model_path) is not None  # You can insert a health check here

    status = 200 if health else 404
    return flask.Response(response="\n", status=status, mimetype="application/json")

@app.route("/invocations", methods=["POST"])
def transformation():
    """Do an inference on a single batch of data. In this sample server, we take data as CSV, convert
    it to a pandas data frame for internal use and then convert the predictions back to CSV (which really
    just means one prediction per line, since there's a single column.
    """
    data = None

    # Convert from CSV to pandas
    if flask.request.content_type == "text/csv":
        data = flask.request.data.decode("utf-8")

        s = io.StringIO(data)
        data = pd.read_csv(s, header=None)

        data.columns = FEATURES

        print('Columns:', data.columns)
        print(data)
    else:
        return flask.Response(
            response="This predictor only supports CSV data", status=415, mimetype="text/plain"
        )

    print("Invoked with {} records".format(data.shape[0]))

    # Do the prediction
    predictions = ScoringService.predict(data)

    # Convert from numpy back to CSV
    out = io.StringIO()
    pd.DataFrame({"results": predictions}).to_csv(out, header=False, index=False)
    result = out.getvalue()

    return flask.Response(response=result, status=200, mimetype="text/csv")

We can change the inside of ping and transform functions as we wish as long as they handle the input properly and return a proper response that Sagemaker requires.

1.4 — Updating Our Dockerfile to install Nginx

The only addition to the Dockerfile in our previous post is adding nginx into install list and adding execution permission to serve:

FROM --platform=linux/x86-64 python:3.8

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    libusb-1.0-0-dev \
    libudev-dev \
    build-essential \
    ca-certificates \
    nginx && \  # --------------------> HERE IS THE CHANGE 
    rm -fr /var/lib/apt/lists/*

# Keep python from buffering the stdout - so the logs flushed quickly
ENV PYTHONUNBUFFERED=TRUE

# Don't compile bytecode
ENV PYTHONDONTWRITEBYTECODE=TRUE

ENV PATH="/opt/app:${PATH}"

ENV PYTHONPATH=.

RUN pip3 install pipenv==2022.7.4

# Install packages
WORKDIR /opt/app
COPY Pipfile Pipfile.lock ./
RUN pipenv install --deploy --system --dev

# Add src code
COPY src ./
RUN chmod +x train
RUN chmod +x serve  # --------------------> HERE IS THE CHANGE

A brief info about nginx:

Nginx is a web server. It’s the public handler, more formally called the reverse proxy, for incoming requests and scales to thousands of simultaneous connections.

2-Testing The Docker Container Locally

We have made all the changes we need in our code base to train and serve the model using Docker container with Sagemaker. We now can test it locally before testing it on Sagemaker side actually.

2.1— Testing the Docker Container in Local For Training

First let’s create a docker-compose.yml file with the following command in the root folder:

---
version: "3.3"
services:
  training:
    build: .
    container_name: byoc_training
    command: train
    volumes:
      - ./ml_data:/opt/ml/
    env_file:
      - .env

In that docker-compose.yml file, we are mounting container’s /opt/ml/ path to ./ml_data folder in our local. So when we run the container using docker compose, it will mount everything under ./ml_data into container’s /opt/ml/ therefore any files/folders under ./ml_data will be accessible in the container’s opt/ml folder, and any changes under container’s /opt/ml/ path will be reflected to our local ./ml_data folder. So for this purpose, let’s create ml_data folder in the project root with proper folders/files in it:

├── Dockerfile
├── Pipfile
├── Pipfile.lock
├── build_and_push.sh
├── docker-compose.yml
├── ml_data
│   ├── input
│   │   ├── config
│   │   │   └── hyperparameters.json
│   │   └── data
│   │       └── train
│   │           └── master_df.csv
│   ├── model
│   └── output
└── src
    ├── nginx.conf
    ├── predictor.py
    ├── serve
    ├── train
    └── wsgi.py

Since we are reading hyperparameters in the train file under opt/ml/input/config/hyperparameters.json, we must define it under ml_data/input/config/hyperparameters.json.
Since we are reading model data in the train file under opt/ml/input/data/train/master_df.csv, we must define it under ml_data/input/data/train/master_df.csv.

In our train file, we are reading TRAINING_FILE_NAME and FEATURES as environment variables. So when we run our Docker container for local training, we must pass those environment variables into the container. The easiest way to do it is to create a .env file with environment variables defined in it in the same folder with docker-compose.yml file, then pass that file to the container. The last two lines in the docker-compose.yml file does this. Our .env file looks like this:

TRAINING_FILE_NAME=master_df.csv
FEATURES="sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)"
ARTEFACT_PATH=/Users/mertatli/medium/ml_data  #Change it to your local ml_data path

Then let’s create a Makefile file with the following commands in the root folder:

SHELL := /bin/bash

train:
 docker-compose build training \
 && docker-compose run training

inference:
 cd src/ \
 && pipenv run flask run # if a .env file is present, pipenv run will automatically load it

ping:
 curl --location --request GET 'http://127.0.0.1:5000/ping'

Up to this point, we must have the following folder structure:

├── .env
├── Dockerfile
├── Makefile
├── Pipfile
├── Pipfile.lock
├── build_and_push.sh
├── docker-compose.yml
├── ml_data
│   ├── input
│   │   ├── config
│   │   │   └── hyperparameters.json
│   │   └── data
│   │       └── train
│   │           └── master_df.csv
│   ├── model
│   └── output
└── src
    ├── nginx.conf
    ├── predictor.py
    ├── serve
    ├── train
    └── wsgi.py

Then open up a shell in the root folder, then run the following code:

make train

If it runs successfully, we should see feature_imporance.pckl and model.pckl under ml_data/model folder:

ml_data
├── input
│   ├── config
│   │   └── hyperparameters.json
│   └── data
│       └── train
│           └── master_df.csv
├── model
│   ├── feature_importance.pckl
│   └── model.pckl
└── output

Then congrats, we successfully trained our models and produced model artefacts. Next, we will run the container for model serving.

2.2 — Testing the Docker Container in Local For Model Serving

Open up a shell in the root folder, then run the following code:

make inference

It should be accepting some calls to endpoints:

Then open Postman application, and send a ping request with GET method to http://127.0.0.1:5000/ping:

Then open Postman application, and send a /invocations request with POST method to http://127.0.0.1:5000/invocations with text body ‘5.1,3.5,1.4,0.2’:

You may need to do the following changes because our API accepts ‘text/csv’ content_type:

if flask.request.content_type == "text/csv":

3-Conclusion

In this post, we have covered how to prepare our code for training and serving purposes in Sagemaker using Docker. In the next post, we will cover how to run Batch Transform jobs on Sagemaker using our Docker container, as well as how to deploy Realtime Inference Endpoint using the same Docker container. While serving the model on Sagemaker side, we may need to introduce some minor changes/settings in the predictor.py file as we will see shortly.