What will be covered in this article:
- How SageMaker works
- How to prepare a model for SageMaker
- How to use AWS Lambda to trigger model training and deployment automatically
source code for this article can be found at https://github.com/xg1990/aws-sagemaker-demo.
1. SageMaker Introduction
It has 3 levels of API we can work with:
- High Level API:
All you need to do is to define the model training/prediction/data input/output function, and then submit/deploy the source code, with necessary configurations (e.g.: instance types). The SDK will take care of the rest work (e.g.: load data from S3, create training job, publish model endpoint)
- Mid-Level API:
Besides defining the source code, you also need to upload the source code to S3 yourself, and specify the s3url to the source code, and explicitly setup all other configurations.
- Low-Level API:
Essentially SageMaker does everything within a container. Users can create their own docker container and make it do whatever they want. These containers are called
Algorithm in SageMaker
In this article, I will cover the usage of
boto3. Defining your own docker container (low level API) is only necessary when your ML model is not based on any of the SageMaker supported frameworks: Scikit-Learn, Tensorflow, PyTorch, …..
There are many modules provided by SageMaker. In this article, the following module will be used:
- Notebook instances: full managed jupyter-notebook instance where you can test your machine learning code with access to all other AWS services (e.g. S3)
- Training jobs: the place to manage model training job
- Models: the place to manage trained models
- Endpoints: full managed web service that can handle requests (HTTP or others) as input and make predictions as responses
- Endpoint configurations: configuration for endpoints
2. Prepare SageMaker Model
To train and deploy a machine learning model on SageMaker, we need to prepare a python script that defines the behaviours of our model by the following python functions:
- defined within
if __name__ == '__main__'.
- It will be run by SageMaker with several command line arguments and environment variables pass into it.
- It should load training data from
--traindirectory and output the trained model (usually a binary file) into
- Example code looks like this:
Besides the main script, other functions are defined for model deployment, which includes the following functions:
model_fn: loads saved model binary file(s)
input_fn: parses input data from different interface(HTTP request or Python function calling)
predict_fn: takes parsed input and make predictions with the model loaded by
output_fn: encodes prediction made by
predict_fnand return to corresponding interface (HTTP response or python function return)
model_fnfunction should look into the
model_dir and load the saved model binary file into memory for prediction tasks. Example code looks like this:
input_fn takes two arguments:
input_fn should check the type of input data (
request_content_type) first before it can be parsed. You can also do data pre-processing here. Example code looks like this:
predict_fn takes two arguments:
input_data is the output of
model is the output of
Usually, we may need to do some data transformation before prediction. Example code looks like this:
output_fn also takes two arguments:
prediction is the output of
accept is the expected output format (e.g.
application/json) from the client side.
All above functions should be put into a python script (let’s say it is
train_and_deploy.py ), then we can use
python-sagemaker-sdk to test our model for SageMaker in our local environment.
3. Model development with python-sagemaker-sdk
train_and_deploy.py script, we need to prepare another python script to run
python-sagemaker-sdk , which look like this:
from sagemaker.sklearn import SKLearn if our model is based on Scikit-learn. If the model is based on Tensorflow, we can use
from sagemaker.tensorflow import TensorFlow instead.
One thing to pay attention: Every time we call
sklearn_estimator.deploy , the SageMaker SDK will start a new docker instance and run the corresponding job, which is very slow if the job is done on the server side for debugging purpose. In this case, we can set
instance_type=’local’ to conduct local testing, which is much faster (make sure
docker is set up on a local machine)
Test Published Endpoint
One the trained model is published as an endpoint service, we can test the endpoint with new data:
4. Using Lambda Function to control SageMaker
Now we have successfully trained and deployed a model on SageMaker. But it not enough.
In real world, we should receive new data every day and need to retrain the machine learning model periodically.
Moreover, model training usually takes a long time and we need to make sure one the training job is done, the trained model should be automatically deployed into existing endpoint (SageMaker does not do this automatically).
There are two solutions for this: Step Function and Lambda
- Solution 1-Step Function. We can define a lambda function to check the status of a training job. Then we use a step function to call the lambda function (e.g. every 1 hour), once training is done, call another lambda function to deploy the model. This solution has been well-documented in this article https://github.com/aws-samples/serverless-sagemaker-orchestration
- Solution 2: Lambda. Once a model training job is finished, the trained model will be written into an S3 bucket. S3 PUT event can be associated with a lambda function to trigger model deployment.
This article will demonstrate the solution 2, how we use lambda function and S3 event to manage SageMaker.
lambda_handler takes the argument
event, which contains information about how the lambda is triggered. By interpreting the event, we can conduct different action within one lambda handler. Following is an example code of the lambda function we use:
Three different events are handled here:
- If input
eventis S3 event, it will call
- Otherwise, if
eventis model re-train task, it will call
- And if
eventis prediction task, it will call
make_prediction()and pass the result as response
Trigger model retrain task periodically
Lambda function can be triggered by CloudWatch event periodically.
We can add a CloudWatch Event trigger and set up a Rule.
- Event Source should be Schedule, with customised event pattern (very similar to crontab on Linux)
- Set Targets as the lambda function we use to control SageMaker.
- Configure input can be Constant (JSON text), so that the
lambda_handlercan understand what to do with it.
This is an example of model re-training function.
This function will create a model training job.
Please be noted that we cannot use
python-sagemaker-sdk within lambda environment. Then the best solution is to use
boto3. That means we need to upload the
train_and_deploy.py to S3 in a gzipped tar package by ourselves. And also set up everything as done in above code.
Once a model training job is done, we need to deploy the trained model and update existing endpoint. SageMaker doesn’t do this for you automatically. And we cannot ask Lambda to wait for the training job, as training may take several hours.
Once a training job is done, an S3 PUT event will be triggered, which can notify our lambda function that training is done and we can do deployment now.
Within an S3 PUT event, the key of the S3 object is provided, which is usually related to the unique ID of our training job, as is done in the following code:
Once the model training job ID is known, we then can call the
deploy_model function to deploy our model, which looks like this:
And then, everything is set up and our model will keep training periodically and provide the best performance.
SageMaker is especially helpful when you need to retrain your model periodically and serve your model as a web service. For training, SageMaker can automatically start a high-performance EC2 instance and finish model training within a short time at the minimum cost. For web serving, SageMaker can take care of auto-scaling and make sure your endpoint is always available.
To use SageMaker for Machine Learning, the most important step is to prepare a script that defines the behaviours of your model. And you also have full control of the whole system by creating your own docker container.
Other things to be aware of
- We have not involved model performance test in this workflow, which is also important in a machine learning pipeline;
- S3 Event doesn’t guarantee 100% delivery. If model training is critical, Step-Function is a better choice
- Make sure your AWS role have enough permission to control necessary resources
- SageMaker can also run batch prediction jobs, and there are many other functions remain to be explored.
Find out more about Servian’s AI and ML capabilities here.