Deploying Hugging Face Transformers model on AWS Lambda with Docker containers

13 min readAug 25, 2023

Leveraging Open-Source Power: Hugging Face, Big Tech, and the Revolution of Deployment

In recent years, open-source platforms have surged in popularity and utility, becoming central players in the world of artificial intelligence (AI) and machine learning (ML). Hugging Face, a trailblazer in this domain has simplified the process of implementing state-of-the-art NLP models. But they’re not alone in this open-source rally. Tech juggernauts like Google, Facebook, and Microsoft have increasingly been contributing by releasing their open-source models, democratizing the field of AI and fostering a collaborative environment.

While having access to these models is revolutionary, the real challenge often lies in deployment. Traditionally, many have leaned on cloud provider endpoints such as Azure ML or AWS SageMaker for this purpose. While convenient, these services charge based on either the hardware used to host the model or the tokens per API call. With the former, costs can be predictable, but scaling often results in ballooning expenses. Cloud providers have tried to mitigate these expenses by offering solutions like real-time inference or serverless inference, but they don’t always provide the granular control and customization that many developers crave.

Enter AWS Lambda functions a solution that presents both affordability and scalability. Historically, deploying substantial models like transformers was challenging due to Lambda’s size constraints. But with AWS’s game-changing announcement that Lambda now supports custom Docker containers up to 10GB, this hurdle has been eradicated. Developers can now harness the might of Hugging Face’s transformer models, deploying them efficiently using Lambda in tandem with Docker and AWS CDK.

Cost and Scalability: Delving into Lambda’s Financial Dynamics

One of the most crucial considerations when deploying transformer models (or any model, for that matter) is the associated cost and scalability. With Lambda, the equation is primarily centered on two factors: storage and invocation charges.

1. Storage Costs: In terms of storage, costs within Lambda are quite negligible. While many might worry about space given the size of AI models, this is rarely the significant portion of the overall price.

2. Invocation and Memory Costs: The real weight of the cost comes from the number of Lambda invocations and the memory consumed during these invocations. A key point to remember with Lambda is that allocating more memory can lead to faster execution times. When determining the memory requirements, one can assess their Lambda’s performance during local testing (a process we’ll delve into later in the blog).

To gauge the size requirements for a particular model, inspect the “.bin” file within the Hugging Face model repository. This will provide a solid estimate of the model’s footprint. Given that AWS Lambda now supports Docker containers up to 10GB, most standard models — be it for sentiment analysis, question answering, or other foundational tasks — should comfortably fit within this limit.

Let’s Crunch Some Numbers: Consider deploying a transformer model of size 4GB with a memory allocation of 6GB. The AWS Lambda free tier provides 1 million invocations and 400,000 GB-seconds of compute time monthly. Once you exceed this free tier, the next 100,000 invocations would cost approximately $0.02. Factoring in compute time, assuming an average execution duration of 5 seconds for our example, the compute cost for these invocations is approximately $50. Thus, the total cost for an additional 100,000 invocations would be $50.02.

Note: It’s essential to bear in mind that this calculation is a basic estimate. The actual costs could be different based on factors like execution duration, data transfer, and additional AWS services utilized. Always refer to AWS’s official documentation and pricing pages for up-to-date and detailed information.

Setting Up Python Environment

Before you start, ensure you have Python 3.11 installed.

# Check Python version
python --version

If not, download Python 3.11 and install it.

Install the Required Libraries

Create a virtual environment to isolate the dependencies. Navigate to your project directory and run:

python -m venv venv

Activate the virtual environment:

Windows:

venv\Scripts\activate

macOS/Linux:

source venv/bin/activate

Now, install the necessary libraries

pip install transformers torch

Setting up the Project Structure

Create the necessary folder structure:

mkdir lambda-service
cd lambda-service
mkdir lambda

Inside the lambda directory is where you'll place your Python scripts that constitute the Lambda function.

Downloading the Model and Tokenizer

Navigate to the lambda-service folder and create a script, say download_model.py.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

MODEL_NAME = "cambridgeltl/sst_mobilebert-uncased"
# Download and save the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.save_pretrained("./lambda/model")
# Download and save the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained("./lambda/model")
print("Model and tokenizer downloaded successfully!")

Execute this script:

python download_model.py

After executing this script, the model and tokenizer will be saved in the ./lambda/model directory. These will be bundled with your Docker image later on. This way, during Lambda invocations, the model and tokenizer are directly loaded from this directory, which is faster than fetching from the Hugging Face model hub.

Note: When using this approach, ensure you have enough space in your Docker image, considering the AWS Lambda has a 10GB total limit for custom runtimes.

Also, it’s essential to be aware that the initialization time for your Lambda (i.e., the “cold start” time) might increase because loading large models into memory can take a few seconds. However, this trade-off is beneficial as it prevents the model from being downloaded repeatedly and ensures that the Lambda function can work entirely offline (once the model is loaded).

Why Download the Model Directly into Docker?

AWS Lambda is incredibly versatile, but it comes with certain constraints, especially concerning disk storage and writable directories. Typically, when you use AutoModel.from_pretrained() with Hugging Face, the library attempts to download the model and then caches it. This default behavior is optimized for environments where repeated downloads are inefficient.

However, Lambda’s environment presents two challenges:

Read-Only Directories: Most directories in a Lambda function are read-only. Although you can modify the environment variable TRANSFORMERS_CACHE to point to the writable /tmp directory, it leads us to the second challenge.
Limited /tmp Storage: Lambda provides only 512MB for the /tmp directory. Many transformer models, especially those fine-tuned for specific tasks, can easily exceed this limit.

Therefore, to ensure our model operates efficiently within a Lambda function, we adopt an alternate strategy: bundle the model (and its tokenizer) directly into the Docker container we’ll deploy to Lambda. By doing so, we achieve the following:

Scalability: This approach is storage-efficient. Lambda recently started supporting custom Docker runtimes up to 10GB, which is more than enough for most transformer models.
Reliability: By embedding the model into our container, we reduce external dependencies, ensuring our Lambda function works seamlessly without relying on external downloads.

Add a requirements.txt file for the docker

torch
transformers

Lambda Handler for Sentiment Analysis

The provided code segment outlines a streamlined AWS Lambda function optimized for sentiment analysis using the Hugging Face transformer model. Initially, the model and its associated tokenizer are pre-loaded, which enhances cold start efficiency. During invocation, the handler extracts the input text, tokenizes it, and then leverages the transformer model to predict the sentiment. Finally, the identified sentiment — either ‘positive’ or ‘negative’ — is returned as part of the response. Adapt as needed, ensuring compatibility with the specific transformer model in use.

Certainly! Below is a simple Lambda handler function in Python that uses a pre-loaded Hugging Face transformer model to predict the sentiment of a given text.

import json
from transformers import MobileBertForSequenceClassification, AutoTokenizer, AutoConfig
import torch

# Load the model and tokenizer once (for cold start optimization)
MODEL_PATH = "./model"  # Ensure this is the path where you've saved your model in the Docker container
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
config = AutoConfig.from_pretrained(MODEL_PATH)
model = MobileBertForSequenceClassification.from_pretrained(MODEL_PATH, config=config)
model.eval()  # Set the model to evaluation mode

def lambda_handler(event, context):
    """
    AWS Lambda function handler for predicting sentiment.
    """
    # Extract text from the incoming event
    text = event['body']['text']
    # Tokenize the text and convert to tensor
    inputs = tokenizer(batch_sentences, max_length=256, truncation=True, padding=True, return_tensors='pt')
    input_ids, attention_mask = inputs.input_ids, inputs.attention_mask

    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    predictions = torch.argmax(outputs.logits, dim = -1)
    sentiment = predictions[0].item()
    # Return the response
    response = {
        "statusCode": 200,
        "body": json.dumps({
            "text": text,
            "sentiment": sentiment
        })
    }
    return response

Dockerfile for the lambda

FROM public.ecr.aws/lambda/python:3.11

COPY requirements.txt ./
RUN python3 -m pip install -r requirements.txt --target ${LAMBDA_TASK_ROOT}

COPY ./ ${LAMBDA_TASK_ROOT}/

CMD [ "handler.handler" ]

Here’s a breakdown of what the Dockerfile does:

FROM public.ecr.aws/lambda/python:3.11 : This is specifying the AWS Lambda Python 3.11 base image. AWS provides these official images for all the runtimes that Lambda supports.
COPY requirements.txt ./: This is copying your requirements.txt into the Docker container's working directory.
RUN python3 -m pip install -r requirements.txt --target ${LAMBDA_TASK_ROOT}: Here you’re installing the required packages inside the container using pip, and you’re placing them in the LAMBDA_TASK_ROOT directory. The --target option allows you to specify where the packages should be installed, which is critical for AWS Lambda.
COPY ./ ${LAMBDA_TASK_ROOT}/: This is copying all the files from your current directory on the host machine into the ${LAMBDA_TASK_ROOT} directory of the Docker container. This is crucial since it's moving your code, your downloaded model, and other essential files into the container.
CMD ["handler.handler"]:This defines the default command that will be executed when the container starts. handler.handler suggests you have a file named handler.py and inside that file, there's a function named handler. This function will get executed by Lambda whenever an event triggers the function.

Build the Docker Image

First, you’ll need to build the Docker image using the docker build command:

docker build -t sentiment-analyzer:latest .

Here:

-t sentiment-analyzer:latest gives the Docker image a tag/name, which in this case is sentiment-analyzer with the tag latest.
. indicates the Dockerfile's location, which is the current directory.

Run the Docker Image

Once the image is built, you can run it using the docker run command. If you've set up your Dockerized lambda function to run on a specific port, you'll need to map that port to a port on your local machine. However, since you're running a lambda function within the container, you'll simulate an invocation instead:

docker run -p 9000:8080 sentiment-analyzer:latest

This command maps port 9000 on your local machine to port 8080 on the Docker container.

Invoke the Lambda Function Locally

With the Docker container running, in a separate terminal window, simulate a lambda invocation using curl:

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{"text": "I love AI"}'

You should see the output of your lambda function in response. If you encounter errors, the logs in the terminal where you ran the Docker container can provide valuable debugging information.

Setting up the CDK

AWS Cloud Development Kit (CDK) empowers developers to define cloud infrastructure in code and provision it using AWS CloudFormation. By employing CDK, we’ll be able to seamlessly manage and scale our AWS resources without the hassle of dealing with large, often confusing CloudFormation templates.

Creating the CDK Folder Structure

Directory Structure: Start by setting up a dedicated directory for the CDK code within your main project. Navigate to the lambda-service folder and create a new directory named cdk

cd lambda-service 
mkdir cdk 
cd cdk

CDK Toolkit: Before we move further, ensure that the AWS CDK toolkit is installed. If you haven’t done this already, you can find detailed installation instructions here.

Initializing the CDK App: Once you have the CDK toolkit in place, initialize a new CDK app using the TypeScript language. This will generate a pre-defined directory structure tailored for a TypeScript-based CDK app:

cdk init app --language typescript

With these steps, you’re now set up with a standard CDK directory. This structure will make it straightforward to add resources, manage dependencies, and deploy your stack to AWS. In the upcoming sections, we will define our resources and deploy our model onto Lambda using this CDK setup.

Installing Required Libraries

Here are the required CDK libraries and their versions:

API Gateway (V2)
Lambda (Go and Python)
Standard CDK library

To ensure smooth integration and avoid potential breaking changes, it’s crucial that we keep version consistency.

Execute the following commands within the cdk folder to install these libraries.

npm install @aws-cdk/aws-apigatewayv2-alpha@2.76.0-alpha.0 \
            @aws-cdk/aws-apigatewayv2-authorizers-alpha@2.76.0-alpha.0 \
            @aws-cdk/aws-apigatewayv2-integrations-alpha@2.76.0-alpha.0 \
            @aws-cdk/aws-lambda-go-alpha@2.76.0-alpha.0 \
            @aws-cdk/aws-lambda-python-alpha@2.76.0-alpha.0 \
            aws-cdk-lib@2.92.0

Defining the CDK Stack

Navigate to your cdk-stack.ts file and replace its contents with the following:

import {
  App,
  CfnOutput,
  Duration,
  Stack,
} from "aws-cdk-lib";
import * as path from "path";
import * as lambda from 'aws-cdk-lib/aws-lambda';
import { DockerImageCode } from "aws-cdk-lib/aws-lambda";
import { HttpApi, HttpMethod } from "@aws-cdk/aws-apigatewayv2-alpha";
import { HttpLambdaIntegration } from "@aws-cdk/aws-apigatewayv2-integrations-alpha";


const rootPath = path.join(__dirname, "..");

const app = new App();

const stack = new Stack(app, "LambdaMLStack", {
  env: {
    region: "eu-west-1",
    account: process.env.CDK_DEFAULT_ACCOUNT,
  },
});

const entry = path.join(rootPath, "lambda", "sentiment");


const lambdaFn = new lambda.DockerImageFunction(
    stack,
    "SentimentLambda",
    {
      code: DockerImageCode.fromImageAsset(entry),
      memorySize: 5120,
      timeout: Duration.minutes(15),
    }
  );

const sentimentIntegration =  new HttpLambdaIntegration('sentimentIntegration', lambdaFn)

const api = new HttpApi(stack, "SentimentApi");

api.addRoutes({
  path: '/qa',
  methods: [ HttpMethod.POST ],
  integration: sentimentIntegration,
});

new CfnOutput(stack, "SentimentEndpoint", {
  value: api.url || "",
});

Setting the Stage

const rootPath = path.join(__dirname, "..");

This line creates a variable rootPath that holds the path to the parent directory of the current script.

const app = new App();

Here, an instance of the CDK app is being initialized. An app in CDK is the main entry point for a cloud assembly.

Creating a Stack

const stack = new Stack(app, "LambdaMLStack", {
  env: {
    region: "eu-west-1",
    account: process.env.CDK_DEFAULT_ACCOUNT,
  },
});

A new cloud stack called LambdaMLStackis being defined here. The stack belongs to the eu-west-1 region and is linked to the AWS account specified in the CDK_DEFAULT_ACCOUNT environment variable.

Setting up the Lambda Function

const entry = path.join(rootPath, "lambda");

This determines the path to the sentiment directory inside the lambda folder.

const lambdaFn = new lambda.DockerImageFunction(
    stack,
    "SentimentLambda",
    {
      code: DockerImageCode.fromImageAsset(entry),
      memorySize: 5120,
      timeout: Duration.minutes(15),
    }
  );

A Lambda function is being defined here. The function is based on a Docker image, which means it will be created from a Docker container. This is a newer feature of AWS Lambda where you can package and deploy Lambda functions as container images of up to 10 GB.

Integrating Lambda with an API

const sentimentIntegration =  new HttpLambdaIntegration('sentimentIntegration', lambdaFn);

This creates an integration between an HTTP API Gateway and the Lambda function. Essentially, it’s saying “when the API is hit, run this Lambda function.”

const api = new HttpApi(stack, "SentimentApi");

An HTTP API is defined with the name SentimentApi.

api.addRoutes({
  path: '/qa',
  methods: [ HttpMethod.POST ],
  integration: sentimentIntegration,
});

A new route (endpoint) /qa that accepts POST requests is being added to the API. When this route is accessed, it will trigger the sentimentIntegration, and thus the Lambda function.

Output

new CfnOutput(stack, "SentimentEndpoint", {
  value: api.url || "",
});

This is an AWS CloudFormation output. When the stack is deployed, the URL of the SentimentApi will be printed as an output, so you can easily access and test the API. If for some reason api.url doesn't exist, it defaults to an empty string.

Updating the `cdk.json` file

Navigate to the root of your CDK project and open the cdk.json file. You should see a JSON structure similar to:

{
  "app": "node ts-node --prefer-ts-exts bin/cdk-app.js"
}

Replace it with:

{
  "app": "npx ts-node --prefer-ts-exts cdk-stack.ts"
}

This change instructs the CDK Toolkit to use the TypeScript node (ts-node) to execute the cdk-stack.ts file. By using --prefer-ts-exts, we ensure TypeScript files are given a higher priority over JavaScript files.

Now, whenever you run CDK commands like cdk deploy or cdk synth, the toolkit will use the cdk-stack.ts file as the entry point for your application.

Setting up AWS Credentials

Before you can deploy resources to AWS using CDK, AWS CLI, or any AWS SDK, you must first set up your credentials.

Install AWS CLI: If you haven’t already, install the AWS Command Line Interface. This provides you with the tools to manage your AWS resources from the command line and to set up credentials.

pip install awscli

Configure AWS CLI: After installing, run the aws configure command. This command will prompt you for your AWS Access Key, Secret Key, default region, and default output format.

aws configure

Make sure to enter the appropriate values. These credentials are typically provided to you when you create an IAM (Identity and Access Management) user within your AWS account. Ensure that this IAM user has the necessary permissions to deploy the resources defined in your CDK app.

Deploying with CDK

Bootstrap the CDK Environment (if you haven’t done so): This step is necessary the first time you deploy a CDK app to a new environment. It sets up resources that the CDK CLI needs to deploy your CDK app.

cdk bootstrap

Deploy the Stack: Now you’re ready to deploy your stack. Navigate to your CDK project directory and run:

cdk deploy

The CDK CLI will display information about the resources that will be created or modified as part of this deployment. If you’re okay with these changes, confirm the deployment.

Monitor Deployment: The CDK CLI will provide updates as resources are created and configured. Once everything is done, it should print the CloudFormation outputs you’ve defined, like the SentimentEndpoint from your previous code.

Wrapping Up

Deploying machine learning models, especially the ones from the Hugging Face library, in a serverless environment poses unique challenges due to memory and storage constraints. Yet, the solution we’ve walked through manages to bypass some of these challenges, making use of AWS Lambda’s Docker support and AWS CDK’s infrastructure as code approach. While this method is particularly suited for medium-sized models, it offers a cost-effective and scalable way to test and iterate on models before pushing them for larger-scale usage.

For those who wish to delve deeper or adapt the solution to their own use cases, the entire codebase for this project is available on GitHub: https://github.com/mgsudhanva/sentiment-service. A special nod of acknowledgement goes to the insights derived from this instructive article: https://towardsdatascience.com/serverless-bert-with-huggingface-aws-lambda-and-docker-4a0214c77a6f, which laid foundational knowledge for parts of this guide.

Thank you for journeying through this tutorial. I trust it provided valuable insights and I’d be elated if it paves the way for your next successful deployment. Stay curious and keep coding!

Looking Ahead: What’s Next in Part 2?

Our exploration into deploying Hugging Face transformer models on AWS Lambda doesn’t end here. While we’ve covered the foundational strategies and techniques in this part, the next segment delves deeper into optimizing this setup and introducing innovative methodologies to enhance scalability and efficiency.

In Part 2, we’ll focus on:

Transitioning from a basic lambda-handler approach to a server-as-a-lambda setup using FastAPI.
The intricacies of route management with a proxy route strategy.
Unveiling the strengths and constraints of this approach in real-world applications.

Interested? Dive deeper and discover how to harness the full potential of serverless deployments for machine learning models.

Deployment of Hugging Face Transformers on AWS Lambda via Docker — Part 2

Elevating our Deployment: The Power of Server-as-a-Lambda in AWS”

medium.com