Deploying AWS Lambda functions with Docker Container by using Custom Base Image

Atahan Bulus
8 min readSep 15, 2023

--

This story is about Deploying AWS Lambda function with Docker image by using AWS ECR with custom Docker Base Image.

Problem definition: What to do when your needed python library is not available in AWS Lambda base image?(public.ecr.aws/lambda/python:ver.x)

Introduction: We use AWS Lambda layers for python libraries that are smaller than 250MB. Once we need to use larger libraries than 250MB we need to deploy them with docker container images. In the following example I use fastdup library to run on Lambda function.

If you are new to AWS Lambda I am suggesting to first watch: “ AWS Lambda function creation and ECR tutorial.”

What does our function do: This function simply works once triggered by the S3 Bucket. It will look for the {S3_bucket/image_folder} and read images. Will run fastdup to create similarity.csv. After creating similarity.csv we can read with pandas so that we can remove duplicate images.

1. Create your Dockerfile

At this point we can refer to official guide: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html#python-image-clients

  • Here is Dockerfile:
# Define custom function directory
ARG FUNCTION_DIR="/function"

# Stage 1: Build the function code and dependencies
FROM python:3.9 as build-image

# Include global arg in this stage of the build
ARG FUNCTION_DIR

# Copy function code
RUN mkdir -p ${FUNCTION_DIR}
COPY . ${FUNCTION_DIR}

# Install the function's dependencies
RUN pip install \
--target ${FUNCTION_DIR} \
awslambdaric

# Stage 2: Create the final runtime image
FROM python:3.9-slim

# Include global arg in this stage of the build
ARG FUNCTION_DIR

# Set working directory to function root directory
WORKDIR ${FUNCTION_DIR}

# Copy in the built dependencies from the previous stage
COPY --from=build-image ${FUNCTION_DIR} ${FUNCTION_DIR}

# Install the function's Python dependencies (move this step to the runtime stage)
RUN pip install -r ${FUNCTION_DIR}/requirements.txt

# Set runtime interface client as default command for the container runtime
ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ]

# Pass the name of the function handler as an argument to the runtime
CMD [ "lambda_function.handler" ]

2. Create your lambda_handler.py

In the following script It creates a folder in source_bucket. This lambda function is intended to get triggered and read image folder from s3 bucket then remove duplicate images from the folder itself.

import fastdup
import boto3
import os
import pandas as pd


os.environ['AWS_EXECUTABLE_PATH'] = '/function/aws'
SOURCE_BUCKET = 'bucket-2' # Where images are stored
SIMILARITY_FILE = 'similarity.csv'
threshold = 0.975


def file_upload(s3, file, bucket, key):
"""Uploads a file to a bucket with the desired key."""
print(f'Uploading {file} to {bucket}/{key}')
try:
s3.meta.client.upload_file(file, bucket, key)
except Exception as e:
print(f"Error: {str(e)}")
return {
'statusCode': 503,
'body': f'Error: {str(e)}'
}
def create_folder(s3, bucket, folder_name):
"""Creates a folder in a bucket with the desired name."""
try:
# Create a folder in the destination S3 bucket
s3.Object(bucket, folder_name).put()
print(f"{bucket}/{folder_name} created successfully")
return folder_name
except Exception as e:
print(f"Error: {str(e)}")
return {
'statusCode': 503,
'body': f'Error: {str(e)}'
}

def run_fastdup(input_dir, output_dir):
"""Runs fastdup on the images in the input directory and writes the results to the output directory."""
try:
print("ver:", fastdup.__version__)
fastdup.run(input_dir=input_dir, work_dir=output_dir)
except Exception as e:
print(f"Error: {str(e)}")
return {
'statusCode': 503,
'body': f'Error: {str(e)}'
}
def read_csv(file):
"""Reads a CSV file into a Pandas DataFrame."""
try:
df = pd.read_csv(file)
return df
except Exception as e:
print(f"Error: {str(e)}")
return {
'statusCode': 503,
'body': f'Error: {str(e)}'
}

def get_duplicate_images(df):
"""Returns a list of clean and duplicate images list."""
try:
duplicates_df = df

clean_images = []
duplicate_images = []

for index, row in duplicates_df.iterrows():
if row['from'] not in clean_images:
if row['from'] not in duplicate_images:
clean_images.append(row['from'])

if row['to'] not in clean_images:
if row['to'] not in duplicate_images:
duplicate_images.append(row['to'])

return clean_images, duplicate_images
except Exception as e:
print(f"Error: {str(e)}")
return {
'statusCode': 503,
'body': f'Error: {str(e)}'
}

def get_path(path):
"""Returns the relative path of a file."""
try:
base_dir = '/tmp/tmp/'
relative_path = os.path.relpath(path, base_dir)
return relative_path
except Exception as e:
print(f"Error: {str(e)}")
return {
'statusCode': 503,
'body': f'Error: {str(e)}'
}

def handler(event, context):
"""
AWS Lambda function handler.
This function is triggered when a <text>.duplicate file is uploaded to the source bucket.
runs fastdup on the images in the input directory and writes the results to the output directory which is /tmp.
Uploads the similarity.csv file to the destination bucket.
Deletes the duplicate images from the source bucket.
"""
s3 = boto3.resource('s3')
records = event['Records']
bucket = records[0]['s3']['bucket']['name']
key = records[0]['s3']['object']['key']

source_bucket = SOURCE_BUCKET
target_name, _ = os.path.splitext(os.path.basename(key))

print(f'Triggered bucket name:{bucket}, Source bucket name:{source_bucket}')
print("Key:", key)

folder_name = f'fastdup-output/{target_name}/'
folder_key = create_folder(s3, bucket=bucket, folder_name=folder_name)

print(f"running fastdup on {target_name}")
run_fastdup(input_dir=f's3://{source_bucket}/{target_name}', output_dir='/tmp') # Use /tmp directly

print(f"uploading results to {bucket}")
for file in os.listdir('/tmp'):
if file == SIMILARITY_FILE:
csv_file_path = os.path.join('/tmp', file)
df = read_csv(csv_file_path)
similarity_df = df[df['distance'] > threshold]
file_upload(s3, file=csv_file_path, bucket=bucket, key=f'{folder_key}{file}')

# Moving operations for csv
_, duplicate = get_duplicate_images(similarity_df)

for line in duplicate:
source_key = get_path(line)
s3.Bucket(source_bucket).Object(source_key).delete()

3. Create requirements.txt

boto3
fastdup
awscli
pandas
  • Now your required files are ready to create Docker Container Image.

To create docker container image follow these steps:

  1. I assume you are familiar with AWS console and logged in. From your aws console, go to Amazon Elastic Container Registry. (https://us-east-2.console.aws.amazon.com/ecr/)
  • Create a repository. I named as fastdup-test. Make sure you have your role permissions to create one.
  • After creating the repository, go to your newly created repo and you will see a screen like this. Then click on View push commands.
  • Once you clicked on View push commands. You will paste on your terminal one by one. Once you pushed successfully to ECR. you will be able to see that image.
  • After confirming your repository and image, create a new lambda function and chose Container image. Then click Browse Images and select your newly created image.

Now you have successfully created your lambda function.

Now lets add a trigger to start our demo

  1. Go to your fastdup-demo lambda function. And click on “+ Add trigger”

2. In here you can select a trigger as your desire. I simply picked S3 bucket as a trigger and chose “PUT”. Also, I added “<a_file_name>.<suffix>”. This suffix is for our trigger condition. That means, once we upload <file_name>.duplicate our lambda function will triggered and start working. <file_name> is required to read image source folder for our fastdup library that I write in lambda_handler.py.

3. Now it’s time to edit our function’s configuration settings:

Timeout, Memory and Ephemeral storage are selected max values to work with it. Ephemeral storage is simply the /tmp/ folder that is the only write-only folder we can put our fastdup outputs on AWS Lambda.

You can adjust those settings as your desire.

Now lets start to execute our lambda function

Once you upload triggering file ,mine is “project_name.duplicate”, to triggering S3 bucket that we added as trigger in previous step, code starts.

You can watch your log from the Cloudwatch:


2023-09-15T15:50:11.289+03:00 /usr/bin/dpkg

2023-09-15T15:50:11.606+03:00 START RequestId: e471c690-89fd-40b5-ab02-37b05b3c765f Version: $LATEST

2023-09-15T15:50:11.827+03:00 Triggered bucket name:bucket-1, Source bucket name:bucket-2, Destination bucket name:bucket-2

2023-09-15T15:50:11.827+03:00 Key: trigger-files/techtalk.duplicate

2023-09-15T15:50:11.905+03:00 bucket-1/fastdup-output/techtalk/ created successfully

2023-09-15T15:50:11.905+03:00 running fastdup on techtalk

2023-09-15T15:50:11.905+03:00 ver: 1.38

2023-09-15T15:50:11.905+03:00 FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.

2023-09-15T15:50:12.180+03:00 2023-09-15 12:50:12 [INFO] Going to loop over dir s3://bucket-2/techtalk

2023-09-15T15:50:12.844+03:00 2023-09-15 12:50:12 [INFO] Found total 6 images to run on, 6 train, 0 test, name list 6, counter 6

2023-09-15T15:50:15.676+03:00 [■■■■■■■■■ ] 17% Estimated: 0 Minutes [■■■■■■■■■■■■■■■■■ ] 34% Estimated: 0 Minutes [■■■■■■■■■■■■■■■■■■■■■■■■■■ ] 50% Estimated: 0 Minutes [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ] 67% Estimated: 0 Minutes [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ] 84% Estimated: 0 Minutes [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 100% Estimated: 0 Minutes 2023-09-15 12:50:15 [INFO] Found total 6 images to run on

2023-09-15T15:50:15.680+03:00 Finished histogram 0.012

2023-09-15T15:50:15.680+03:00 Finished bucket sort 0.016

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] 3) Finished write_index() NN model

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] Stored nn model index file /tmp/nnf.index

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] Total time took 2836 ms

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] Found a total of 6 fully identical images (d>0.990), which are 50.00 % of total graph edges

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 % of total graph edges

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] Found a total of 6 above threshold images (d>0.900), which are 50.00 % of total graph edges

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] Found a total of 1 outlier images (d<0.050), which are 8.33 % of total graph edges

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] Min distance found 0.726 max distance 1.000

2023-09-15T15:50:15.680+03:00 2023-09-15 12:50:15 [INFO] Running connected components for ccthreshold 0.960000

2023-09-15T15:50:15.959+03:00 .0uploading results to bucket-1

2023-09-15T15:50:15.966+03:00 Uploading /tmp/similarity.csv to bucket-1/fastdup-output/techtalk/similarity.csv

2023-09-15T15:50:16.169+03:00 END RequestId: e471c690-89fd-40b5-ab02-37b05b3c765f

2023-09-15T15:50:16.169+03:00 REPORT RequestId: e471c690-89fd-40b5-ab02-37b05b3c765f Duration: 4561.94 ms Billed Duration: 11936 ms Memory Size: 3008 MB Max Memory Used: 345 MB Init Duration: 7373.

From the log we can see in details. Hope this writing will be helpful to anyone who seeks to use libraries that can’t be found in aws lambda base image.

--

--