Build a Streaming Inference Pipeline by Deploying Apache MXNet on AWS Lambda

Published in

Apache MXNet

12 min readJul 11, 2020

Thanks to Sandeep Krishnamurthy, Olivier Cruchant, Thom Lane and Kevin Mould for their feedback

Introduction

AWS Lambda is a compute service that allows you to run code without provisioning or managing servers. It also introduces scalability, with the ability to run up to 3000 tasks concurrently. More importantly, you only pay for the compute time you consume, which can lead to significant cost savings in certain situations. When combining AWS Lambda with deep learning models, you can get outstanding model performance with the convenience of AWS Lambda. In this post, we will walkthrough how to build a streaming inference pipeline using Apache MXNet and AWS Lambda. In particular, we will discuss how to deploy a package that exceeds the typical Lambda upload limit. And finally, we will analyze the pipeline performance and costs.

Use case scenario

A deep learning streaming inference pipeline is a perfect use case for MXNet on AWS Lambda. Let’s assume an e-commerce company wants to extract meta data from their product images. For example, Airbnb uses image classification models to identify what type of room (kitchen, pool, garden) is displayed in a picture. This is where the streaming inference pipeline comes in: it continuously monitors the data flow, performs image classification whenever an image is acquired, and saves the results into long term storage (such as Amazon S3). AWS Lambda manages everything for you: from provisioning to scaling. Additionally, the company only pays for the compute time that AWS Lambda needs, which makes it very cost efficient for intermittent workloads.

Inference pipeline workflow

The above data extraction use case boils down to the following workflow:

An image is uploaded to an Amazon S3 bucket (input bucket).
The image triggers the AWS Lambda function to retrieve resources from the resource bucket and perform image classification using the MXNet model.
The inference result is stored in another Amazon S3 bucket (output bucket).

Build the pipeline

This section will show you how to build the above pipeline step by step. Before we start, please don’t forget to configure your AWS Command Line Interface (CLI) if you haven’t already done so:

$ aws configure

Amazon S3 buckets

The first step is to create the following storage components of the pipeline:

input_bucket: An S3 bucket to receive the input images.
resource_bucket: An S3 bucket to host the resource files for the Lambda function during runtime.
output_bucket: An S3 bucket to store the inference results.

You can either use the Amazon S3 management console or the following CLI commands to create the S3 buckets:

$ aws s3api create-bucket --bucket your-input-bucket-name \
                          --region your-region \
                          --create-bucket-configuration \
                          LocationConstraint=your-region
$ aws s3api create-bucket --bucket your-resource-bucket-name \
                          --region your-region \
                          --create-bucket-configuration \
                          LocationConstraint=your-region
$ aws s3api create-bucket --bucket your-output-bucket-name \
                          --region your-region \
                          --create-bucket-configuration \
                          LocationConstraint=your-region

Prepare the deployment package

The next step is to prepare the deployment package for AWS Lambda. A deployment package is a compressed archive that contains the function code and dependencies. AWS lambda will run this code when triggered. In our case, it should contain the following items:

lambda_function.py — the main function that performs inference during runtime.
model.params — the model parameter file. In this case we use a pre-trained ResNet50_v2 model (download here). You can also use your own model. More details of saving/loading model can be found here.
synset.txt — the label file for Imagenet dataset. It maps the model output (integer) to an object class (string). For instance, ‘559’ is mapped to ‘folding chair’. You can find it here.
dependencies — the libraries that the lambda function depends on, such as MXNet, numpy, etc.

Dependencies
The following command downloads the current released version of MXNet (1.6.0) and all its dependencies in to folder package at the current directory.

$ pip install mxnet -t . package

Note: please make sure that you download the dependencies within an operating system that is compatible with Amazon Linux (AL/AL2), since the Lambda function runs on Amazon Linux. The easiest way to do this, is to run an Amazon EC2 instance with Amazon Linux, or you can use Docker to setup an Amazon Linux environment.

Lambda upload limit
Adding everything to the package folder, we end up with the following file structure:

package
----lambda_function.py
----resnet50_v2.params
----synset.txt
----mxnet
----numpy
----numpy.libs
...
----urllib3

Our current package exceeds 370 MB (mxnet 190 MB, resnet50_v2.params 100 MB, numpy 50 MB, numpy.libs 30MB), but the deployment package limit for Lambda is 250 MB. This makes it impossible to upload the whole package as is. However, AWS Lambda provides an additional 512 MB of storage inside the “/tmp” directory (while running the Lambda function). All files in this directory will be stored during the life cycle of the Lambda instance. Thus, we can split the package into smaller parts: one uploaded to AWS Lambda directly, and the others are downloaded from the “/tmp” directory during runtime. By splitting the package and exploiting the “/tmp” storage, we increase the size limit from 250 MB to more than 750 MB.

In this implementation, we split the package as seen below. You can split this in other ways, as long as it fits within the AWS Lambda constraints.

pkg_lambda
----lambda_function.py
----synset.txt
----mxnetpkg_tmp
----numpy
----numpy.libs
...
----urllib3resnet50_v2.params

pkg_lambda
This package contains all the files that will be uploaded to AWS Lambda directly. The size is reduced to around 200 MB (from 370MB before). Use the following command to compress it and get pkg_lambda.zip:

~$ cd pkg_lambda
~/pkg_lambda$ zip -r9 ${OLDPWD}/pkg_lambda.zip .

We also provide our pkg_lambda.zip here.

You can either use the Amazon S3 management console or the following CLI command to upload it to the resource_bucket:

$ aws s3 cp pkg_lambda.zip s3://your-resource-bucket-name

pkg_tmp and model file
These files will be stored in the resource_bucket and get downloaded into the “/tmp” directory during the AWS Lambda runtime. The size is around 85 MB. Use the following command to get pkg_tmp.tar.gz:

~$ cd pkg_tmp
~/pkg_tmp$ tar -cvzf pkg_tmp.tar.gz .

We also provide our pkg_tmp.tar.gz here.

You can either use the Amazon S3 management console or the following CLI command to upload it to the resource_bucket:

$ aws s3 cp pkg_tmp.zip s3://your-resource-bucket-name

The last thing to upload is the model file:

$ aws s3 cp resnet50_v2.params s3://your-resource-bucket-name

Note: here we choose to separate the model file from pkg_tmp. Uploading it separately without compressing into package would buy us more space in “/tmp” directory when downloading it. It also makes updating model file easier since we don’t need to update the whole pkg_tmp.

lambda_function.py
lambda_function.py contains a function called lambda_handler which takes the trigger event (an image in our case) as an input and performs the model inference.

In this file we:

Specify the AWS Lambda runtime to download pkg_tmp and model file from resource_bucket.
Transform the input images into the required format for the model.
Perform image classification inference.
Send the results to output_bucket.

Note: The lambda_function.py contains multiple non event-specific tasks such as downloading pkg_tmp, loading model parameters, generating label list, etc. We should try to put these code blocks on top of the lambda_handler function scope to reduce the inference latency. More detail can be found in our implementation of lambda_function.py.

Creating an AWS Lambda function

Once the deployment packages are ready, the next step is to create the AWS Lambda function.

Set an IAM role
First, we need to create an IAM role that permits AWS Lambda to communicate with the other AWS components in the pipeline. Go to the AWS IAM management console: Roles → Create role → Lambda, and attach the following policies:

AmazonS3FullAccess
AmazonLambdaBasicExcutionRole
CloudWatchEventsFullAccess

More details of IAM roles can be found here

Create the Lambda function
The next step is to create the AWS Lambda function, attach the IAM role to it and upload the pkg_lambda. You can either use the AWS Lambda management console or the following CLI command:

$ aws lambda create-function --function-name your-function-name \
             --code S3Bucket=your-resource-bucket-name, \
             S3Key=pkg_lambda.zip \
             --handler lambda_function.lambda_handler \
             --runtime python3.7 \
             --role arn:aws:iam::your-aws-account: \
             role/your-role-name \
             --timeout 30 \
             --memory-size 1024

Add a trigger to lambda
The last step is to add an S3 “All object create events” trigger to the Lambda function. Go to the AWS Lambda management console → Functions → Select the function we just created → add trigger → select S3 → specify input_bucket for bucket name→ specify “All object create events” for Event type.

Test
At this point, we’ve successfully built an inference pipeline. To test it, upload an image to the input_bucket. A new file that contains the predicted object class should appear in the output_bucket.

MXNet with MKL-DNN

Intel MKL-DNN provides various highly vectorized and threaded operators to accelerate deep learning frameworks. MXNet supports MKL-DNN to achieve better training and inference performance. In this section, we will demonstrate how to enable MKL-DNN in our inference pipeline.

The only difference is the MXNet library. Use the following command to download the MKL-DNN enabled MXNet (version 1.6.0) :

$ pip install mxnet-mkl -t . package

Note that the MXNet library itself reaches 280 MB, and therefore no longer fits within the 250 MB limitation. We must rearrange the deployment package as follows to satisfy the limitation:

pkg_lambda_mkl
----lambda_function.py
----synset.txt
----numpy
----numpy.libs
...
----urllib3pkg_tmp_mkl
----mxnetresnet50_v2.params

You can prepare your own packages or download our pkg_lambda_mkl.zip and pkg_tmp_mkl.tar.gz here. Once the two packages are uploaded to the resource_bucket, you can either use the AWS Lambda management console or run the following command to update the pipeline and enable MKL-DNN:

$ aws lambda update-function-code \
             --function-name your-function-name \
             --s3-bucket your-resource-bucket-name \
             --s3-key pkg_lambda_mkl.zip

Inference latency

Latency is a key factor in our inference pipeline. With AWS Lambda functions, the latency depends on whether the instance is “cold” or “warm”. Cold start inference is much slower due to the initialization tasks. Latency also depends on cup resources. We observe that larger memory allocation in Lambda typically means better latency, but with higher costs. In this section, we summarize the latency for both mxnet and mxnet-mkl inference.

The figure above shows the cold start inference latency. Each value is measured by averaging 5 cold start inference latencies.

The figure above shows the warm start inference latency. Each value is measured by averaging 20 consecutive warm start inference latencies.

We observe that enabling MKL-DNN increases the cold start latency due to its larger package size. The overall cold start latency is around 6 seconds, which is a huge time cost for deep learning inference. In production, we should try to avoid cold start. On the other hand, the warm start takes around 400 ms in general, which is acceptable in many use cases. The figure also indicates that enabling MKL-DNN improves the inference latency by around 15%.

Inference cost

Cost is another important factor to the pipeline. Here we mainly focus on the warm start cost since it happens to most inference requests. AWS Lambda offers free usage of 1M requests and 400,000 GB-seconds per month. It charges per request basis after the free usage. The following figure shows the cost per million requests after free usage, calculated based on AWS Lambda pricing.

We see that:

Smaller memory saves the cost although it affects latency.
Enabling MKL-DNN not only speeds up the inference but also lower the cost.

Cost efficiency

Traditional deep learning servers (like Amazon EC2 instances) have fixed charges per month. On the other hand, the cost of Lambda is proportional to the number of inference requests. We benchmark the cost between mxnet-mkl on AWS Lambda and a commonly used EC2 setup c5.xlarge+eia2.medium (the same setup with this study), which costs $208.80 per month.

From the figure above, we observe that the cost of AWS Lambda 2048 MB and c5 instances intersects at around 15 million image requests, showing that Lambda is more cost efficient when monthly requests numbers are less than that. The threshold between Lambda 1024 and c5 is around 22 million requests, indicating a greater potential for AWS Lambda’s cost efficiency.

From the previous analysis, we saw the cost advantages of the Lambda pipeline over the EC2 instance. However, those calculations assume “warm start” latency for most of the requests. Therefore, it is crucial to minimize “cold starts” of the AWS Lambda function and keep them “warm” as much as possible. In fact, an empty AWS Lambda instance will be invoked when the first request arrives. It first sets up the environment and then processes the actual request. This is the so called “cold start,” which takes longer due to the initialization steps. Once the cold start has completed, the AWS Lambda instance caches all the setup during its life cycle and gets “warm”. That’s the reason why “warm start” latency is so much better. Unfortunately, a warm instance only waits for about 5 to 10 min for the next request. The instance will be recycled if the interval between two requests is longer than that period of time, and Lambda will have to invoke another instance starting from a “cold start”. As a result, having a traffic pattern with a request interval of less than 5 mins can reduce the frequency of cold starts. Otherwise, applying a simple pre-warming technique can keep AWS Lambda instances alive and solve the cold start issue.

Deploying other models

It is relatively easy to deploy other deep learning models with this pipeline. We only need to upload the new model file and the corresponding synset.txt file to the resource bucket, and then modify the lambda_function.py accordingly. The only restriction of deploying a model on AWS Lambda is the size limitation of the packages: 250 MB to Lambda and 512 MB to “/tmp”. In the above implementation, the pkg_tmp is 85 MB, which left more than 400 MB for the model files. This should be sufficient for most of the MXNet computer vision models, e.g. ResNet152 (240 MB), Faster-RCNN (170 MB), Mask-RCNN (180 MB). However, enabling MKL-DNN would squeeze the space available for the model file to around 200 MB. For large models that don’t fit within the limitations, please consider other techniques to further reduce the package size, such as Deep Java Library and Deep Learning Runtime. You can also combine Lambda with Amazon EFS to run large files.

Changing storage component

You may notice that the pipeline is built using various AWS services. This makes it flexible and easy to change some of the components to other AWS services. Amazon DynamoDB is a fully managed database that performs data queries in a very efficient way. It also allows us to store the inference result along with other customer defined information. Here, we will show you how to switch the output storage from an Amazon S3 bucket to an Amazon DynamoDB table.

First, define a simple table that stores three features for each input image. You can edit or add more based on your needs.

UUID: a UUID4 code as the unique identifier of the image.
ImageName: the file name of the image.
ObjectClass: the predicted object class for the image.

Use either the Amazon DynamoDB management console or the following CLI command to create the DynamoDB table:

$ aws dynamodb create-table \
      --table-name your-output-table-name \
      --attribute-definitions AttributeName=UUID,AttributeType=S \
      --key-schema AttributeName=UUID,KeyType=HASH \
      --provisioned-throughput \
      ReadCapacityUnits=5,WriteCapacityUnits=5

Add the following IAM policy to the IAM role we created previously:

AmazonDynamoDBFullAccess

Next, modify the lambda_function.py to send the result to the table, prepare the deployment packages and upload the two packages into the resource_bucket. Our pkg_lambda_ddb.zip and pkg_tmp_ddb.tar.gz can be found here.

Finally, you can either use the AWS Lambda management console or run the following command to update the pipeline and connect the DynamoDB table to the pipeline:

$ aws lambda update-function-code \
             --function-name your-function-name \
             --s3-bucket your-resource-bucket-name \
             --s3-key pkg_lambda_ddb.zip

Conclusion

In this article, we demonstrated step by step how to build a streaming inference pipeline using MXNet and AWS Lambda. We addressed the AWS Lambda upload limit issue by splitting the deployment package and exploiting “/tmp” directory (giving us over 750 MB of potential storage space in total). We then benchmarked the inference performance of MXNet with and without MKL-DNN. Our results indicated that enabling MKL-DNN could speed up the inference by about 15%. And finally we showed that AWS Lambda inference is more cost efficient than an c5.xlarge instance when monthly requests are less than 22 million in our case.