How to dockerize Python Pytesseract for AWS Lambda using Elastic Container Registry
In this article, I am going to explain the whole process of orchestrating a docker image to deploy in ECR and finally run using AWS lambda in a serverless environment.
Table of Contents:
∘ The Data Engineer’s Problem Statement
∘ Why Docker?
∘ Docker file Contents
∘ Dockerfile does the following
∘ app.py Contents
∘ app.py does the following
∘ requirements.txt Contents
∘ requirement.txt file does the following
∘ Steps to create a docker image
∘ Build
∘ Run
∘ Test
∘ What is ECR
∘ Steps to deploy to ECR
∘ Steps to deploy to AWS Lambda
∘ Test using lambda
∘ About the Author
∘ Github
∘ LinkedIn
The Data Engineer’s Problem Statement:
Convert image or text extensions to pdf. Sound quite simple? Let's dig deeper. The input files are gonna reside in S3. The outputs are also gonna be written to the same S3 folder where the inputs are present. Sounds okay… hmm let's see few more details. There are tesseract binaries needed for the solution to run serverless lambda. Also, the package should run on Amazon Linux. That means the tesseract binaries should be compatible with Amazon Linux. To add to all these, we have lambda restrictions for the deployment package size. And if you’re using windows or mac this is not going to get over soon.
Then I thought of one thing which should work in serverless seamlessly if it works locally— Docker!
Why Docker?
Just because docker can build and share containerized apps — from desktop to the cloud. Period. This gave me enough confidence to go ahead and start developing some piece of code and structure it for docker.
So, I created a structure like this below:
│ Dockerfile
│ requirements.txt
├───app
│ app.py
Docker file Contents:
FROM public.ecr.aws/lambda/python:3.8COPY requirements.txt ./requirements.txt
RUN pip install -r requirements.txtRUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum -y update
RUN yum -y install tesseractCOPY ./app/app.py ./RUN aws configure set aws_access_key_id ""
RUN aws configure set aws_secret_access_key ""
RUN aws configure set region ""
RUN aws configure set output ""CMD ["app.lambda_handler"]
Dockerfile does the following:
At the time of building the image, it will run each statement in the docker file one by one and execute.
- Clones a public ECR image of Amazon Linux from AWS lambda that supports 3.8
- Copies the requirement file and then installs the requirements for the python program app.py to run
- Next, it installs tesseract in the image
- Copies the app folder contents
- Finally, it sets the environment for AWS CLI to work with S3.
- The last line tells the image that app.lambda_handler is my entry point to the application whenever there is a request.
app.py Contents:
from fpdf import FPDF
import os
import pytesseract
import boto3
from PIL import Imagedef download_dir(prefix, local, bucket, client):
"""
params:
- prefix: pattern to match in s3
- local: local path to folder in which to place files
- bucket: s3 bucket with target contents
- client: initialized s3 client object
"""
keys = []
dirs = []
next_token = ''
base_kwargs = {
'Bucket':bucket,
'Prefix':prefix,
}
while next_token is not None:
kwargs = base_kwargs.copy()
if next_token != '':
kwargs.update({'ContinuationToken': next_token})
results = client.list_objects_v2(**kwargs)
contents = results.get('Contents')
for i in contents:
k = i.get('Key')
if k[-1] != '/':
keys.append(k)
else:
dirs.append(k)
next_token = results.get('NextContinuationToken')
for d in dirs:
dest_pathname = os.path.join(local, d)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
for k in keys:
dest_pathname = os.path.join(local, k)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
client.download_file(bucket, k, dest_pathname)def lambda_handler(event, context):
pdf = FPDF()
pdf.add_page()
pdf.set_font('Arial', size=20)session = boto3.Session()
s3_client = session.client('s3')
bucket_name='filestorageexchange'
s3_folder='main_folder/sub_folder'
lambda_write_path = '/tmp/'
download_dir(prefix=s3_folder, local=lambda_write_path, bucket=bucket_name, client=s3_client)for item in os.listdir(main_path := os.path.abspath(os.path.join(lambda_write_path, s3_folder))):
for folder in os.listdir(sub_path := os.path.join(main_path, item)):
for file in os.listdir(sub_folder_path := os.path.join(sub_path, folder)):
Converted = False
file_path = os.path.join(sub_folder_path, file)
print(f'\nProcessing text file...{file_path}')
pdf_file_name = file_path.replace(file_path.split('.')[1], 'pdf')
s3_folder = 'main_folder' + '/' + 'sub_folder' + '/' + item + '/' + folder
s3_object = pdf_file_name.split(os.sep)[-1]
try:
if file_path.endswith('txt'):
pdf.cell(200, 10, txt="".join(open(file_path)))
pdf.output(os.path.join(lambda_write_path, pdf_file_name))
Converted=True
if file_path.lower().endswith(('png', 'jpg', 'gif', 'tif')):
pdf_png = pytesseract.image_to_pdf_or_hocr(file_path, extension='pdf')
with open(os.path.join(lambda_write_path, pdf_file_name), 'w+b') as f:
f.write(pdf_png)
Converted=True
if file_path.endswith(('pcd')):
Image.open(file_path).save(temp_file:=file_path.replace(file_path.split('.')[1], 'png'))
pdf_png = pytesseract.image_to_pdf_or_hocr(temp_file, extension='pdf')
with open(os.path.join(lambda_write_path, pdf_file_name), 'pdf', 'w+b') as f:
f.write(pdf_png)
os.remove(temp_file)
Converted=True
except Exception as e:
print(e)if Converted:
print(f"Created - {os.path.join(lambda_write_path, pdf_file_name)}")
with open(os.path.join(lambda_write_path, pdf_file_name), 'rb') as data:
s3_client.upload_fileobj(data, bucket_name, s3_folder + '/' + s3_object)
print(f"Uploaded to - {s3_folder + '/' + s3_object}")
else:
print(f"Not Created - {os.path.join(lambda_write_path, pdf_file_name)}")if __name__ == "__main__":
if os.name == 'nt':
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
else:
pytesseract.pytesseract.tesseract_cmd = r'tesseract/4.1.1/bin/tesseract'
lambda_handler(None, None)
Note: app.py is written and tested in 3.8
app.py does the following:
- Logins to AWS using appid and appsecret
- Downloads the files from the S3 bucket
- Converts them using pytesseract, fpdf, and pillow
- Writes back to S3
requirements.txt Contents:
fpdf
pillow
pytesseract
awscli
boto3
requirement.txt file does the following:
- Collects all the above python packages.
- Installs all the above packages.
Steps to create a docker image:
Prerequisite:
- Install docker and make sure it is running in the background
- Install docker extension in vs code (optional). This will let you easily browse the image and see logs and many more things.
Build:
docker build -t <image-name> .
or right-click on the dockerfile in vscode and click → build an image from Dockerfile
Once you create this, you’ll see something like this on the vs code extension. The idealoctopotato is the name I had given building it 16 hours ago.
Or you can execute this in terminal
docker images
This will show you all the docker images in the system.
Run:
To run the image enter the following command in the terminal and set a port
docker run -p 9000:8080 <image-name>
Once you do that the terminal will show you this
docker run -p 9000:8080 idealoctopotato:latest
time="2021-07-18T15:54:14.975" level=info msg="exec '/var/runtime/bootstrap' (cwd=/var/task, handler=)"
So this image is running in localhost port 8080/tcp and 9000 is the binding address.
List the container images:
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9c19c488f2dc idealoctopotato:latest "/lambda-entrypoint.…" 16 hours ago Up 16 hours 0.0.0.0:9000->8080/tcp, :::9000->8080/tcp dreamy_chebyshev
Vs code docker extension helps you navigate into the image directory and verify the files and contents.
Test:
Open another terminal and type
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d "{}"
This will start app.py processing inside the container.
You can switch back to the terminal where you started running the docker earlier and see the logs from the application while it runs.
What is ECR
From AWS docs —
Amazon Elastic Container Registry (ECR) is a fully managed container registry that makes it easy to store, manage, share, and deploy your container images
So the idea is to use ECR and host the docker image in the cloud which we just created locally.
Steps to deploy to ECR
- Go to Amazon Elastic Container Registry
- Create a private repository as lambda doesn't support public ones :
3. Now use this command to log in and obtain privileges
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com
If everything is ok, it should give you this —
Login Succeeded
4. Next use this command to add a tag to the image you have created
docker tag <docker_image_id> ecr_uri
image id — can be obtained by running docker images in the terminal and copying the id from the image.
ecr_uri — Copy the URI from the repository in the console.
5. Run this to push the image to the repository in AWS.
docker push ecr_uri
Once you do that, you should see something like this
When everything is pushed, it will show you —
latest: digest: sha256:2a0c7a019ddabaf92babxxxxxxxxxxxa1879e size: 3673
Now, you can refresh the AWS console and see the latest tag there with your image.
Steps to deploy to AWS Lambda
Now that your image is in ECR, open the Lambda console
- Select container image option in AWS lambda function page.
- Put the image URI and create
This should look like
aws_account_id.dkr.ecr.us-east-1.amazonaws.com/python-aws-docker:latest
Once done it will show you this —
Test using lambda
- Hit test in the AWS lambda Cloud9 ide and you can see the same application logs that you just saw in docker locally.
- Visit S3 and verify the pdfs by downloading them.
- Check the cloudwatch logs.
The final result —
That's it! You’ve just built a serverless Amazon Lambda Function with pytesseract using Docker container image from Amazon Elastic Container Registry which converts various image formats to pdf that reside in S3.
About the Author:
I am Kuharan Bhowmik and I hope you have enjoyed this article. The above work is still in progress and is a private repository currently hosted in Github. I welcome any suggestion on the same.
Let's make the internet a better place by contributing our bit of knowledge.