Deploy Scalable ML model using AWS Sagemaker

Published in

TheTeamMavericks

13 min readApr 8, 2020

A while ago I was struggling in deploying a scalable ML model that can handle parallel requests. I searched through a lot of different techniques on how to scale the model to handle parallel requests but most of the techniques were having some problems or tedious unnecessary tasks to perform. Some of the issues I faced were :

Load Balancing
Concurrent Request handling
Latency

After pondering through various technologies, I came in touch with AWS Sagemaker. Within a week, I fell in love with the enormous support AWS Sagemaker provides. It is the most feasible way to meet all the infrastructural needs to scale the ML model so that it can be deployed for a larger base of customers.

Why the Scalability of ML models is required?

Figure 2. A gif demonstration of Sentimental Analysis AWS Comprehend

Let’s take a look at this sample gif. It is an implementation of sentimental analysis using AWS Comprehend which internally uses a machine-learning-based model to process the input and get the required results. (If you want to give a try use this link:- https://aws.amazon.com/getting-started/tutorials/analyze-sentiment-comprehend/).

Have you ever wondered that if we have some ML models prepared, can it be scaled in such a way that it can be made available to customers? I have personally seen a lot of fascinating machine learning, deep learning and natural language processing based projects which are being developed out there. The end result is to deploy this model in such a way that it can be available to customers. Several requirements which should be met are:-

The endpoint deployed should handle concurrent requests.
The endpoint deployed should ensure that latency for every request made is not high.
The endpoint deployed should also have a load-balancing technique to ensure that if network ingestion(i.e. use of the endpoint) is too high, it should handle the requests without getting crashed.

With these points keeping in mind it is important to develop the project in such a way that it should be capable enough to handle the aforementioned requirements. Thus comes the power of AWS Sagemaker which can help to convert your custom models in such a way that the scalability requirements can be met.

Prerequisites

This article will be covering and develop a simple project to show you how the ML models can be used to meet the scalability requirements. Nothing to worry here I will be explaining all the necessary concepts here but I will not be focusing on Machine learning for this article. Thus, it would be great if you have some Machine learning background before reading this article. A basic understanding of python will also be good.

Some of the technologies which are used here (I will be explaining the first 2 topics but not in detail) are :

Flask: Flask is a web application framework written in Python.
Docker: It is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and deploy it as one package.
AWS ECR: Amazon Elastic Container Registry (ECR) is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images.
AWS Sagemaker: Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models.

Also, I’ll be using this tutorial/guide as the frame of reference for this blog. It’s a really good tutorial. There are a few reasons I thought of reinventing that blog post:

I feel that some details and key points are missing from the sagemaker documentation.
It was difficult for me to find out a step-step guide to develop a project which will meet all the scaling needs.
Also autoscaling a sagemaker endpoint doesn’t have very detailed documentation so I decided to cover this technique in this article.

Overview of Sagemaker compatible Docker Containers

Sagemaker expects the docker image to have a specific folder structure. Mainly there are two-parent folders in which the sagemaker expects the code and artifacts to be organized. /opt/program is the folder where the code resides and /opt/ml is the folder where the artifacts reside.

Figure 3. Adapted from of awslabs/ amazon-sagemaker-examples Github

Based on Figure 3, there are few advantages is developing a Sagemaker-Based docker model endpoint. These advantages are:-

Version Set Handling: While developing ML models at a larger scale, it is sometimes important to store different versions of the artifacts so that if one system fails then it can be revert back to previous model artifacts easily.
Failure Handling: One of the best things about deploying endpoints in sagemaker is that failure capturing becomes easy and gets properly logged. There are two ways with the failure logs that can be captured, One way is using cloud watch logs and Second, this is to use the sagemaker inbuilt folder structure to store the failure logs.
Autoscaling of Sagemaker Endpoint: The scaling of the endpoint with AWS Sagemaker is very easy. It will automatically create a load-balancer to balance network latency. Not only that Sagemaker endpoint autoscaling function is elegant because of its On-demand scaling feature. Based upon the traffic received it will automatically scale more if this traffic is high, else scaling will be less thus helping us to make a cost-effective solution.

Note: To know more about autoscaling and its features use can visit this website:- https://aws.amazon.com/blogs/aws/auto-scaling-is-now-available-for-amazon-sagemaker/

A Sample Dockerized Flask Application

Before going ahead and start developing the project I think so it is important to understand how flask application can work using docker. So we are going to develop a very simple Hello World flask web-app which will be exposed using docker endpoint.

Let’s get started:-

We will start by creating a directory name dockerize_flask

mkdir dockerize_flask

You can use this basic app.py file for your basic application(by default the flask version is 0.10.1):

# flask_web/app.pyfrom flask import Flask
app = Flask(__name__)@app.route('/')
def hello_world():
    return 'Hey, We have successfully Dockerized a Flask Web-app'if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0')

Now the next step is to create a simple Dockerfile, we will be using an Ubuntu image and install python3 to run the flask application. Let create a dockerfile named Dockerfile and add the below-mentioned commands.

FROM ubuntu:16.04LABEL maintainer <your-email-id>RUN apt-get update -y && \
    apt-get install -y python3-pip python3-devEXPOSE 5000
WORKDIR /app#Upgrade pip3 library
RUN pip3 install --upgrade pip# install Flask version==0.10.1
RUN pip3 install flask==0.10.1COPY . /appENTRYPOINT [ "python3" ]CMD [ "app.py" ]

Note: EXPOSE command is used to expose by default 5000 port of the docker image formed.Since flask application is by default exposed to port 5000 we can map it to port number 5000 to test if the flask application is working locally or not.

To build and test the docker image follow these steps:-

Step 1: Build the dockerfile created

docker build -t dockerize-flask:latest .

Step 2: Run the docker image formed by mapping the port 5000 of the docker image to that of port 5000 of the flask application.

docker run -v -d -p 5000:5000 dockerize-flask

Step 2: You can test the output by seeing if at the localhost:5000 web-page is getting rendered or not. You can see the output in Figure 4:-

Figure 4. Testing the dockerized flask application

Flow of the Project

Let us break down the flow of the project into 4 different steps, the steps are :

Dockerized Flask Application: It is a docker based web-application which will be used to get the predictions of the input incoming from the customer.
Deploy the Docker Application to ECR: The docker containers can be stored in Elastic Container Registry which can then be used to deploy the docker based models on even bigger EC2 machines.
AWS Sagemaker endpoint deployment: The Docker-based web-app designed will be deployed as an endpoint so that it can be used by your project to autoscale the endpoint. There is one more advantage of deploying the model in sagemaker i.e. the Linux machine on which the docker application will be deployed can be controlled by you, thus based upon your requirements you can either use an EC2 machine having high RAM or EC2 machine having low RAM.
AutoScale the Sagemaker endpoint: Once the sagemaker-endpoint is deployed, we will autoscale it to handle concurrent requests. Believe me guys usually autoscaling takes a lot of time, this method won’t take more than 5 minutes.

Build and deploy a custom ML model

For simplicity, let’s divide the project into four parts based on the aforementioned flow of the project. Let’s get started.

Note: The focus of this article is on how to scale a ML model so that it can be deployed in market. Hence, i will be using this project for demonstration purposes.It is using Decision tree classifier model for prediction and training purposes.

Part A: Packaging and Uploading your Algorithm for use with Amazon SageMaker

Amazon SageMaker uses Docker to allow users to train and deploy your own custom algorithms. In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting, hence we can use the same image for training and hosting.

To build a production-grade inference server into the container, we use the following stack to make the implementer’s job simple:

nginx is a light-weight layer that handles the incoming HTTP requests and manages the I/O in and out of the container efficiently.
gunicorn is a WSGI pre-forking worker server that runs multiple copies of your application and load balances between them.
flask is a simple web framework used in the inference app that you write. It lets you respond to call on the /ping and /invocations endpoints without having to write much code.

The Structure of the Sample Code

The components are as follows:

A. Dockerfile: This will install all the required dependencies in the docker image and will push 5 applications into the docker image. These 5 applications are:

train: The main program for training the model. When you build your own algorithm, you’ll edit this to include your training code.
serve: The wrapper that starts the inference server. In most cases, you can use this file as-is.
wsgi.py: The start-up shell for the individual server workers. This only needs to be changed if you changed where predictor.py is located or is named.
predictor.py: The algorithm-specific inference server. This is the file that you modify with your own algorithm’s code.
nginx.conf: The configuration for the nginx master server that manages the multiple workers. You can change the parameters (number of workers and timeout) based upon your requirements. The parameters and its details are:-

Parameter          Environment Variable            Default Value
---------          --------------------            -------------
Number Of Workers  MODEL_SERVER_WORKERS      The Number Of CPU Cores
Timeout            MODEL_SERVER_TIMEOUT              60 Seconds

B. decision-trees: The directory that contains the application to run in the container.

C. local-test: A directory containing scripts and setup for running a simple training and inference jobs locally so that you can test that everything is set up correctly.

Part B: Build and Deploy Algorithm To ECR(Elastic Container Repository)

Follow these steps to complete this part.

Step 1: Build the Docker file locally.

docker build -t tree-model .

Step 2: Run the docker model and perform training.

docker run — rm -v $(pwd)/local_test/test_dir:/opt/ml tree-model train

Step 3: Test the container locally. Run the first command in One terminal and the second command in the other terminal.

docker run — rm -p 127.0.0.1:8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml tree-model serve./predict.sh payload.csv

Step 4: Create a repository in ECR named “tree-model”.Inside the repository under “View Push Commands” follow the step by step all the necessary commands. You can view Figure 7 & Figure 8 for testing purposes.

Figure 8. Click “View push commands” and follow step-by-step all of the operations

Part C: Train and Deploy your container in AWS Sagemaker

This part mainly comprises of two main subsection training and endpoint deployment. Training job will help to train over the dataset and create model artifacts, endpoint will use these artifacts to get results. Follow these steps to get both the subsection completed.

Step 1: Create a bucket named “awssagemakermodels”, inside the bucket create two folders “training” and “models”.Inside the training, folder uploads the iris.csv file which will be used for training and creating the artifacts. The artifacts will be stored in the “models” folder.

Step 2: Create a training job in Amazon Sagemaker and fill with the details mentioned in the below figures.

Figure 9. Set the IAM role which will provide you S3 bucket access

Figure 11. Set the resource configuration based on the complexity of the model

Figure 13. Set the Input Data Configuration

Step 3: Create an endpoint in Sagemaker with artifacts stored in the S3 bucket. I will also mention a script using which you can test endpoint from your desktop.

Step 4: Create a Model and fill with the mentioned details from the below figures.

Important Note: “Enable Network Isolation” is a functionality which you can disable if you want your container to access internet, else you can disable it(by default).

Step 5: Create an Endpoint configuration and fill with the mentioned details from the below figures.

Step 6: Create endpoint and configure the endpoint-configuration with the one you made in step 5.

Step 7: Test the endpoint using a local python file named “Testing.py”. If it successfully prints the body, Voila you have a working endpoint ready.

import boto3 
import io 
import pandas as pd 
import itertools# Set below parameters 
bucket = '<bucket name>' 
key = '<folder-location>/iris.csv' 
endpointName = 'testendpoint'# Pull our data from S3 
s3 = boto3.client('s3') 
f = s3.get_object(Bucket=bucket, Key=key)# Make a dataframe 
shape = pd.read_csv(io.BytesIO(f['Body'].read()), header=None)# Take a random sample 
a = [50*i for i in range(3)] 
b = [40+i for i in range(10)] 
indices = [i+j for i,j in itertools.product(a,b)] test_data=shape.iloc[indices[:-1]] 
test_X=test_data.iloc[:,1:] 
test_y=test_data.iloc[:,0]# Convert the dataframe to csv data 
test_file = io.StringIO() 
test_X.to_csv(test_file, header=None, index=None)# Talk to SageMaker 
client = boto3.client('sagemaker-runtime')response = client.invoke_endpoint( EndpointName=endpointName,         Body=test_file.getvalue(), 
ContentType='text/csv', 
Accept=’Accept’ )print(response['Body'].read().decode('ascii'))

Part D: Autoscale endpoint to achieve concurrency

This is the final part of the project where we are going to autoscale the endpoint to achieve parallelism.

Recap: Autoscaling is a feature to achieve parallelism on a Sagemaker endpoint. By doing parallelism multiple customers will be able to access the endpoint with low latency.

Step 1: Theoretical Calculations

For autoscaling it is important that you have to decide what kind of traffic you will be expecting on your model. Based on the traffic we decide the target value.

Let us assume Total number of requests is 25,000 requests/minuteThus requests per seconds should be 425 RPSHence,Target value is 425 RPS

Step 2: Let us implement autoscaling based on the calculation of our endpoint. Follow the mentioned steps in the figures below.

Voila, it’s done. We have successfully scaled our project to accept around 25,000 Requests per minute.

Some things you can KEEP IN MIND!!

There are a few configurations or steps you can keep in mind while monitoring the sagemaker endpoint.

Failure Handling: If for given customer results are getting failed then you can directly check under failure folder inside your s3 bucket.
CloudWatch Monitoring: You can monitor the latency and also see the average peak as well as the highest/lowest peak achieved by your models. Also there is a provision of CloudWatch logs which can help you to monitor the health of the endpoint.

Conclusion

It was a long journey, hopefully, fruitful enough to give you an architectural understanding of how to scale ML models with an implementing using AWS Sagemaker and docker.

Summarising it these are the few points you have learned from this tutorial.

What is the importance of the scalability of ML models to achieve customer requirements?
Basic understanding of docker and how to dockerize a flask application
We then discussed an overview of how to make Sagemaker-based docker application
We then understood the flow of the project.
We then build a Dockerfile and it requires applications, we also then learned how to push it to ECR.
We understood how to carry a training-job using AWS Sagemaker.
We also understood how to make different versions of the model artifacts and handle them using a single endpoint.
We then learned how to create a sagemaker endpoint.
We understood what autoscaling is and carried out theoretical as well as a practical implementation of it.
Finally, I have mentioned about few steps/monitoring policies which you can keep in mind if you are planning to use it to meet the customer-end requirements for the longer run.