Serverless deep/machine learning in production — the pythonic 🐍 way ☯

Michael Dietz
Dec 12, 2017 · 4 min read

In this post we will serve a pyt🔥rch deep learning model with AWS lambda. The simplicity and effectiveness of this approach is pretty amazing. For many use-cases this will greatly simplify our production pipeline. On top of that we will likely see improvements in various metrics across the board — we will touch on some of these over the course of this post. Code will be provided as we go to get you up and running quickly 🏃.

The major 🔑 of this post is overcoming AWS Lambda’s tight deployment package size limitation (250 MB uncompressed) by building large dependencies (i.e. numpy and pytorch) from source. No hacks or complicated workarounds, no deep sh*t 💩.

  • If you are using a different deep learning library/framework (i.e. keras/tensorflow) you should be able to follow along with this post and make substitutions where appropriate. Same goes for a different cloud service.
  • If you serve production models natively on mobile (i.e. CoreML or tensorflow on Android) you may want to consider this approach as our lambda function can be called directly from any web or mobile app.

Business Metrics (cost/latency can be slightly tuned)

  • Cost 💰 — you pay only for the compute time you consume — there is no charge when your code is not running. In our example (ResNet-18 pre-trained on ImageNet) classifying 100,000 images/month will cost ~$13.22/month (less if free-tier)
  • Latency 🕐 — our lambda function executes and returns our results in ~2.7 seconds. For many use-cases this is acceptable, for some it’s not 🚗
  • Zero administration — lambda takes care of everything required to run and scale 📈 your code with high availability

Why (from our software engineering perspective)

  • Reduced duplication — we may be serving our model via a variety of front-ends (i.e. web, mobile, chat-bot, etc…). Having to export our model to different libraries/frameworks, re-write inference code, and test for each supported platform sounds painful 😖
  • Flexibility—in part two of this series I’ll present an example of a complete production pipeline (i.e. front-end uploads an image to s3 and related meta-data (i.e. patient demographics, symptoms, etc…) to a database, which triggers our lambda function which executes and posts results back to the database, which our front-end then pulls) 🏃
  • Improved modularity 🗂 — updating our model becomes trivial. Simply update the s3 object where it’s stored. No need to touch lambda or any other part of your pipeline. This will enable continuous builds and deploys of models as R&D progresses, more data is collected, and precision 🎯 improves. No headaches 🤕
  • Environmentally friendly 🌏—we have a solid grasp on the tools we’ve been working with during R&D and well-tested, high-churn code. No need to switch things up in a different production environment
  • Improved security — our model is a major 👜. Our model never needs to be stored on-disc in lambda (if you’re paranoid) and should probably be encrypted. Same goes with user-data 🔐
  • Much less code ✊


AWS provides great documentation to quickly get started with Lambda which I won’t repeat. I’ll be helping you through the tricky, deep learning specific parts. After reading this post follow the tutorial below and refer to the code and tips I provide as needed.

The major 🔑

Lambda imposes some limitations that can be tight for deep learning use-cases.

Luckily as software engineers we’re good in tight spots 🤔 (unless you’re a Node.js developer 🙌)

The major limit we will run into is the 250 MB limit on the uncompressed deployment package size (the 50 MB limit on the compressed deployment package size is not enforced if we pull from s3) < major 🔑 #2. The key to reducing the size of our deployment package is building large dependencies (in this example numpy and pytorch) from source (building from source also builds character 💪). No hacks or complicated workarounds necessary 🙏.

pytorch alone is larger than 1GB when installed from pre-built binaries. By building pytorch from source we can reduce its size to ~124 MB. We save so much space because we specify that we aren’t using CUDA (AWS Lambda doesn’t have this capability yet anyways).

Lines 35–70 are the major 🔑, everything else is from the AWS Lambda tutorial linked above

Our lambda function (triggered by an image being uploaded to Amazon s3) 😎

Using a decorator to setup the model is a little extra but it’s an elegant way to enable the model (pulled from s3 upon loading `main`) to persist across requests in memory and that means saved 💸

That’s it… This is just the tip of the ice-burg. Python dependencies can be stripped down much more if needed (some build flags/optimizations and a simple script like this can cut total size in half by avoiding data duplication). If you hit scale you can think about exporting your trained model to onnx and then importing it to a more production oriented deep learning library such as caffe2.

Stay tuned for an example of a complete production pipeline following this approach!

If your company needs an experienced software engineer specialized in data engineering/science and deep/machine learning contact me or visit for consulting service.

Michael Dietz

Written by

Lead developer ⟠ blockimmo

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade