ML model deployment using AWS Lambda: Tips and tricks

Pascal Niville
Feb 14 · 7 min read

The use of modern cloud services offers interesting opportunities in terms of cost, administrative ease and scalability. All the big cloud companies provide some form of serverless functionality, in which your code can be deployed without you having to worry about server management. You only pay for the number of seconds your code is actually running, which makes it in general much cheaper than hosting your own server 24/7.

In order to capture most of what the serverless environment has to offer, all of the computing has to be done within a very resource-limited environment. A challenging constraint for most machine learning (ML) applications.

At Ixor we wanted to test the feasibility of transforming our ML products into cloud-native applications. Therefore we selected two of our products to perform the test with: the first is a named entity recognition (NER) model for invoice recognition and the second is an image classifier for medical images.

Serverless with AWS

We use AWS as our cloud provider. Its serverless functionality is constructed out of three main components: Lambda, Fargate and Stepfunctions. Here is a quick overview of these components:

AWS Lambda

A lambda function is a small virtual environment that runs your code. Extra configurations that are required to run your code, e.g. python packages, can be added through customisable lambda layers.

When launching the lambda function, these layers have to be downloaded and installed into the environment (cold start). When the execution is completed, the lambda function stays active for 5 more minutes. When a second call comes in within this timespan, it can be executed straight away without having to reinitialise the initialisation (warm start).

AWS Lambda cold and warm start [1]

The main advantage of Lambda functions is that they are fast, even with a cold start you have your code running in a couple of seconds, and moreover they are highly parallelisable. The main disadvantages are its resource limitations, of which the following are the most important ones:

  • Max storage capacity: 512 MB
  • Max size of all the lambda layers: 250 MB
  • Max timeout: 15 min
  • Max size output json : 6Mb

A full list of all the limitations can be found here.

AWS Fargate

Is a container based service with fewer restrictions, but with a higher operational burden. The start-up time is slower compared to an AWS lambda function, although this is a nice option for processes that need more computational resources and require more than 15 minutes, e.g. model training.

The financial model is also different as, in comparison to AWS Lambda where you get charged per invocation and duration of each invocation, AWS Fargate charges you for the vCPU and memory resources of your containerised applications use per second. [2]

AWS Stepfunctions

AWS Stepfunctions is a useful tool for connecting your lambda and/or Fargate functions into a pipeline. Here you can select which function or which group of functions should run in parallel. The example below shows the flow of our pathology project that is separated into four lambda’s. The “PatchGenerator” and “Inference” lambda functions are configured to run in parallel.

Squeezing it all in

Because of the speed advantage we wanted to construct our pipeline purely out of lambda functions. In order to fit everything in these limiting environments, we had to apply some tricks in order to get the job done.


First of all, because the Lambda layers (the building blocks that hold the configuration of the lambda function) have a 250 MB constraint, we had to divide our big chunk of code over several lambda functions. Determining where to cut the code is mainly based on which python packages you are using in different parts of the code.

In practice, this implies that the python packages cannot exceed the threshold of 250 MB in total, which can be reached rapidly if you know that for example the numpy package alone already exceeds 80 MB.

Double zip trick

A Lambda layer is actually a zip file that contains the required packages for your piece of code. The content of the zip file should look like the following:
- python
-- numpy
-- numpy.dist-info
-- ...

When you are building the lambda layer, AWS checks if the unzipped folder is smaller than 250 MB, if not, the layer is rejected.

If you want to put large packages like Pytorch (> 400 MB!) in a Lambda layer, you can zip its folder, reducing its size to 200 MB, and zip it again. This way the layer size remains below the 250 MB limit when the outer zip is unzipped.

However, the package files still have to be unzipped in order for python to access them. This can only be done in the “/tmp” folder, the only writable folder in the lambda function. To do this you can add a python script to the lambda layer, which can be executed by importing it in our lambda function. The code of the script is pasted below.

The double zipped lambda layer should have the following structure:
-- torch
-- numpy
-- ...
- python

The script to extract the package files into the “/tmp” folder looks as follows:

To add in lambda_function:
import pkgunzip
!!! set read_write access for everyone on zip files
import os
import shutil
import sys
import zipfile

pkgdir = '/tmp/packages'
sys.path.insert(0, pkgdir) #add pkgdir to syspath, to map import
default_layer_root = '/opt'
lambda_root = os.getcwd() if os.environ.get('IS_LOCAL') == 'true' else default_layer_root
for zip_requirements in os.listdir(lambda_root):
if zip_requirements.endswith(".zip"):
zipfile.ZipFile(os.path.join(lambda_root,zip_requirements), 'r').extractall(pkgdir)

This approach will grant access to the packages you need, although it further tightens your possibilities as it can consume a large part of your 512 MB storage. In the case of Pytorch, more than 400 MB will already be consumed.

To free this space up again you can remove the “/tmp” packages folder after everything is imported.

Prune packages

Even though you can use the double zip trick to squeeze all your packages into the lambda layer, you will still encounter some cases in which you need to reduce the size of the package itself. How to prune a package is unfortunately package specific. Some general tricks are:

  • Delete the content of the folders and subfolder of package dependencies that you don’t really need and replace them by an empty
    E.g. mlflow requires pandas, but the pyfunc.module we are using, doesn’t. Removing the pandas folder gave an import error, but replacing its content by an empty didn’t have any consequences for our piece of code, but it reduced the package size by 80 MB.
  • In Pytorch you can clear up 80Mb by removing the Tests folder
  • Some people have already started by open sourcing stripped-down versions of packages. However stripping can impose restrictions on its use e.g. the pytorch one only works for inference, not for training.

File streaming

As your project will probably be a connection of many Lambda functions, you will often need to write and load results from S3 buckets to your Lambda function. Because downloading them to your “/tmp” folder could overflow the 512MB limit, if the function stays warm for a longer period, it is better to stream the inputs directly to memory. In python this can be done by the following code.

s3.Object('mybucket', 'hello.txt').put/get(Body='bytestring of your obj')


A practical tool to make your own packages is tiivik’s LambdaZipper. It launches a docker container based on the AWS image, installs the packages you want and saves a zipfile into your working directory. At the time of writing, you will still need to extract the zip file and put all the folders in a python folder before zipping it again and uploading it to your lambda layer.


Thanks to the high parallelisation possibilities, we have been able to reduce the processing time of the image classification model by 80%. For the NER on invoices the processing of one document is slower (5s) compared to our original web service (2s). But… this is only for one document. Since AWS lambda’s scale very effectively, it would also require only 5 seconds to process 500 documents, a degree of scaling which couldn’t be attained by our original web service.

It is too early for us to go into a financial analysis, although this blog gives already a nice analysis of the pricing of the different AWS services.

We can conclude that the AWS serverless environment provides some nice opportunities for ML project deployment, even though you will most likely find yourself frequently balancing over its resource limitations.

At IxorThink, the machine learning practice of Ixor, we are constantly trying to improve our methods to create state-of-the-art solutions. As a software company, we can provide stable products from proof-of-concept to deployment. Feel free to contact us for more information.






IxorThink is the AI and Machine Learning practice of Ixor, a Belgian software vendor

Pascal Niville

Written by



IxorThink is the AI and Machine Learning practice of Ixor, a Belgian software vendor

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade