Serverless Inference for Keras with AWS Lambda

sunil mallya
3 min readNov 17, 2017

--

More than a year ago I published a reference architecture on running Serverless inference on AWS Lambda with Apache MXNet and its viability. There was a lot of promise for near real-time applications with world wide latencies for state of the art imagenet models (RESNet) inferences.

Recently there has been a uptick in adoption of the Serverless inference architecture with BBC iPlayer using Tensorflow and benchmarks extending my previous work with Apache MXNet on Lambda.

I was surprised to find that another popular deep learning framework Keras didn’t have a lot of content on this. While these Keras — Theano focused blogs have existed for a few months: Creating Keras Theano Package and GCC optimized Theano environment. But unfortunately on Sept 28, Theano maintainers decided to shut down support for it. Which prompted users thinking longer term to use other supported backends like Tensorflow, MXNet and CNTK.

As a start I was looking for a Tensorflow backend project to benchmark performance and found this fantastic project package. Others to follow soon

p.s he’s got several other Lambda packs, thanks ryfeus, you rock!

Unfortunately the focus here seems to be on training, which isn’t the ideal setting for AWS Lambda or Serverless platforms which happen to be stateless and often time constrained. The timeout on AWS Lambda currently is 5mins, and that’s not enough time to train any large deep learning network. Theoretically sounds like something awesome and we’d all like to get there, but the most immediate application could be to train smaller networks on AWS Lambda and leverage it for HPO. Matt Wood and I wrote a blog diving deeper on this topic of Serverless AI and HPO an year ago.

Now that we have the background, lets dive in to the fun part. AWS Lambda currently has a limitation on code package size of 50MB compressed. This means that we can’t package all the goodies needed for our task, so its time to get creative. Below are the enhancements/ issues that I fixed:

  • The above package doesn’t have h5py, which is how Keras saves is model by default.
  • We can’t use the keras.preprocessing as it requires Scipy. (Love Scipy, but its 175MB uncompressed!), so lets use the light weight PIL library for image processing.
  • A few code alterations to make it compatible with the versions of TF and Keras. replace “require_flatten” to “include_top” in the function _obtain_input_shape in keras/applications/imagenet_utils.py
  • Most important step is to use the ‘strip’ command to reduce the binary file size for the shared objects.
pip install h5py 
pip install pillow
# replace function param
sed -i -e 's/require_flatten/include_top/g' keras/applications/imagenet_utils.py
# Lets make those binaries lean
find ./ -name “*.so” | xargs strip
# Also remove those .pyc files
find . -name \*.pyc -delete

Don’t worry about copy/paste, the code is here, clone, contribute, extend and be Serverlesss !!

All set, lets see how fast we can run this with SqueezeNet which is an ideal network suited for memory constrained environments. While you may be able to package the SqueezeNet model with the Lambda package, this wouldn’t be possible for most models with the current package size limitations. Hence, the recommended architecture is to download the model outside the context handler and keep it in memory for subsequent calls.

A quick sanity benchmark using wrk from San Francisco to our endpoint in us-west-2:

sunil$ wrk -c 40 -t 8 URL

Avg Latency 377.93ms
Std Deviation: 94.76ms
Max: 959.72ms
1028 requests in 10.10s, 773.25KB read
Requests/sec: 101.76

not bad, ~370ms latency for warmed up Lambda function with container memory size 1536 isn’t bad at all, will post benchmark on deeper nets soon.

* If you try and benchmark with other frameworks please make sure to compare with the same network architecture in their native framework implementation.

Getting Started:

As you can see this is a very viable option, so please give it a spin. Use this Jupyter notebook to code, extend and deploy a Serverless application using Serverless Application Model (SAM) with ease.

--

--

sunil mallya

Builds scalable solutions, Improviser, Co-founder @neonlab #SF