Primal Data Advent Calendar #4: Deploying BERT in AWS Lambda

Published in

Slido developers blog

7 min readDec 7, 2020

This piece is a part of the Primal Data Advent Calendar, a series of miscellaneous articles written by the Slido Data Team. We release one on each Advent day that happens to be a prime (like 7 for the 7th of December). Enjoy!

BERT models are getting more and more popular, many state of the art models feature BERT or some of its derivatives. It comes as no surprise that there is a strong motivation to put these models to production.

However, BERT models are quite large. The size of the basic BERT Base model is around 420 MB, larger models easily reach a gigabyte, e.g. RoBERTa Large (1.5 GB). This makes the deployment of such models more difficult and costly.

Fortunately, there are also smaller BERT models which can compete with the larger ones. TinyBERT reports 96 % of the performance of BERT Base on GLUE Benchmark and Google’s MobileBERT reaches 99 %. Furthermore, a recently published (but not yet reviewed) paper from Amazon Alexa introduces the Bort model, which even surpasses standard BERT.

In this post, we explain how TinyBERT or MobileBERT models can be deployed in a serverless environment (AWS Lambda) with total monthly costs on the order of a few dollars per million predictions and with performance not far from that of the BERT Base model.

AWS Lambda

AWS Lambda is a serverless environment within the Amazon Web Services that provides a very cost-effective option of deployment. The total costs of 1 million requests that require 1 GB of RAM and take 100 ms will be $0.20 (for the number of requests) + $1.67 (for the duration of requests) = $1.87.

The AWS Lambda serverless applications are also very easy to use. The environment takes care of starting a container with user code, running the code and killing the container once it’s no longer needed.

The user is allowed a free 10-second initialization on the container startup which should be enough for loading a model. If the Lambda fails to load within 10 seconds, the initialization will be restarted and the computing time will be paid by the user. Once initialized, the Lambda is kept alive for an unknown period of time (some unconfirmed sources say 5-15 minutes), even if there are no requests so the initialization does not happen on every request.

Limitations

There are obviously limitations to what AWS Lambda can do. Here are the most important ones:

250 MB (262144000 B) deployment package. The code, the dependencies and the model need to fit in this limit, which explains why standard BERT models cannot be deployed in such an environment. The problem could be hacked around by making use of the 512 MB /tmp storage available and downloading the model or the dependencies on initialization but this solution is obviously slower.
900 seconds timeout. In case one would need to run longer jobs, AWS Lambda is probably not the right tool for the job.
3 GB RAM. Should be sufficient for executing a normal BERT model since the texts are usually only 512 tokens long but can be limiting for some applications.

Even though AWS Lambda poses limitations on various system resources, these can still be overcome by steps described below.

Conversion of the model to ONNX

Having a model that is small enough to fit in the 250 MB deployment package is not sufficient; we also need a library to interpret it. The size of a standard neural network library, unpacked in Lambda, is too big: PyTorch is 400 MB in size, TensorFlow 850 MB, so these libraries will not be of much help.

Luckily, there is TensorFlow Lite (6 MB) or ONNX runtime (14 MB). We only address the latter in this post as we did not manage to run a BERT model using TensorFlow Lite interpreter.

In order to run the BERT model using ONNX runtime, we need to convert it to the ONNX format. This is very straightforward with the transformers library which comes with a conversion script and a conversion tutorial. You can simply run:

python convert_graph_to_onnx.py --framework <pt, tf> <pretrained_model_dir> <model>.onnx

where pretrained_model_dir is a directory containing a model saved using model.save_pretrained() and a vocab.txt file for the tokenizer.

Optimizations and quantization

The conversion script even allows us to optimize and quantize the model. Both optimizations and quantization are enabled by using the --quantize option. The procedure creates two extra ONNX models:

Optimized model. The first model is just optimized for execution speed.
Quantized model. The second is optimized and quantized. This results in the size being cut by a factor of 4 but the model’s performance decreases as well. A quantized BERT Base model is 110 MB in size.

Test the equality of the converted model

You may run into some warnings during the conversion process, e.g.
"Converting a tensor to a Python boolean might cause the trace to be incorrect."Thus, it makes sense to test whether the model behaves identically to the original model before conversion.

To do this, we load both models and compare their outputs. We can see that for MobileBERT, the maximum difference of any of the output values is on the order of 10^(-8) which comes from the floating-point imprecision.

Prepare Lambda code

Once we have the model ready, the next step is to create a Lambda repository with the following structure:

model directory containing the model model.onnx and tokenizer vocabulary vocab.txt
main.py file generating predictions using the model

Runtime dependencies

The code above depends on following packages that we can put in a requirements.txt file:

onnxruntime
transformers  # needed for BERT tokenizer

Creating the deployment package

The Python Lambda deployment package is a ZIP file with code and dependencies that will be unpacked in the Lambda container similar to the lambci/lambda Docker image. The deployment package can be created in the following steps:

Create a build folder.
Install dependencies in the build folder.
pip install -r requirements.txt --target build/
Copy source and resource files to the build folder. It is necessary to copy the model along with the tokenizer vocabulary, i.e. the whole model directory.
cp -r main.py model build/
Create the ZIP file.
cd build && zip -r9 ../lambda.zip .

Deploying the image

A Python deployment package can be uploaded to AWS Lambda in two ways:

Direct ZIP upload. The Lambda documentation states that you can only use this method if the package is smaller than 50 MB. Otherwise, you might get weird errors: “Connection was closed before we received a valid response from endpoint URL.”
aws lambda update-function-code --function-name my-function --zip-file fileb://lambda.zip
Upload to S3. If your file is larger than 50 MB you need to upload the ZIP file to the S3 first and then update the lambda code.
aws lambda update-function-code --function-name my-function --s3-bucket my-bucket --s3-key lambda.zip

Pruning the deployment package

If your deployment fails due to the ZIP file being slightly larger than 250 MB, you might still be able to deploy the Lambda. It is a bit ‘hacky’, but it might be sufficient to remove some of the unused parts of your libraries, e.g. if you’re not using numpy.random you can manually remove the numpy/random directory from the build folder and try uploading the code again.

No, we don’t like it either, but if manually removing a package makes the difference between making the Lambda an option and going to the drawing board again, it may sometimes make sense.

A screenshot of the disk usage analysis of a deployment package used to find large libraries to be pruned.

Testing the lambda

Once the lambda has been deployed you can test it with an example input and observe the initialization and prediction time.

{
  "text": "Hey, how are you?"
}

Provisioned Concurrency

If the initialization is taking too long, you can keep the lambda pre-initialized by setting up Provisioned Concurrency. The pricing of the feature is affordable; one can keep a 1 GB RAM Lambda initialized for a mere $12 per month.

More RAM, more speed

It turns out you can reduce the model prediction time by allocating more RAM to the AWS Lambda because CPU gets allocated proportionally with RAM. We can notice from the Figure below that the execution time decreases with more RAM.

The prediction speed-up is very significant in cases where the AWS Lambda environment does not provide enough RAM for all the data structures needed during the prediction and more garbage collection cycles are thus needed during the execution.

Increasing the RAM beyond the actual amount needed for the prediction (~550 MB in case of MobileBERT) continues to shorten the execution time but not so significantly. With the maximum possible amount of RAM (3008 MB) the MobileBERT reaches a response time of 20 ms.

MobileBERT Lambda execution time as a function of allocated RAM.

Conclusion

Despite what it looks like, BERT models can actually be used in AWS Lambda. By using smaller derivatives of BERT such as TinyBERT or MobileBERT in combination with ONNX neural network runtime, we can produce small deployment packages that can easily be deployed in AWS Lambda environment with minimal maintenance and execution costs. Our experience shows that although building the package is not that straightforward, in the end, it is one of the easiest and cheapest ways of getting BERT-style models to production.