How to dockerize Python Pytesseract for AWS Lambda using Elastic Container Registry

Published in

Analytics Vidhya

8 min readJul 26, 2021

In this article, I am going to explain the whole process of orchestrating a docker image to deploy in ECR and finally run using AWS lambda in a serverless environment.

https://godatadriven.com/topic/the-virtual-classroom/

∘ The Data Engineer’s Problem Statement
∘ Why Docker?
∘ Docker file Contents
∘ Dockerfile does the following
∘ app.py Contents
∘ app.py does the following
∘ requirements.txt Contents
∘ requirement.txt file does the following
∘ Steps to create a docker image
∘ Build
∘ Run
∘ Test
∘ What is ECR
∘ Steps to deploy to ECR
∘ Steps to deploy to AWS Lambda
∘ Test using lambda
∘ About the Author
∘ Github
∘ LinkedIn

The Data Engineer’s Problem Statement:

Convert image or text extensions to pdf. Sound quite simple? Let's dig deeper. The input files are gonna reside in S3. The outputs are also gonna be written to the same S3 folder where the inputs are present. Sounds okay… hmm let's see few more details. There are tesseract binaries needed for the solution to run serverless lambda. Also, the package should run on Amazon Linux. That means the tesseract binaries should be compatible with Amazon Linux. To add to all these, we have lambda restrictions for the deployment package size. And if you’re using windows or mac this is not going to get over soon.

Then I thought of one thing which should work in serverless seamlessly if it works locally— Docker!

Why Docker?

Just because docker can build and share containerized apps — from desktop to the cloud. Period. This gave me enough confidence to go ahead and start developing some piece of code and structure it for docker.

So, I created a structure like this below:

│   Dockerfile
│   requirements.txt
├───app
│       app.py

Docker file Contents:

FROM public.ecr.aws/lambda/python:3.8COPY requirements.txt ./requirements.txt
RUN pip install -r requirements.txtRUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum -y update
RUN yum -y install tesseractCOPY ./app/app.py   ./RUN aws configure set aws_access_key_id ""
RUN aws configure set aws_secret_access_key ""
RUN aws configure set region ""
RUN aws configure set output ""CMD ["app.lambda_handler"]

Dockerfile does the following:

At the time of building the image, it will run each statement in the docker file one by one and execute.

Clones a public ECR image of Amazon Linux from AWS lambda that supports 3.8
Copies the requirement file and then installs the requirements for the python program app.py to run
Next, it installs tesseract in the image
Copies the app folder contents
Finally, it sets the environment for AWS CLI to work with S3.
The last line tells the image that app.lambda_handler is my entry point to the application whenever there is a request.

app.py Contents:

from fpdf import FPDF
import os
import pytesseract
import boto3
from PIL import Imagedef download_dir(prefix, local, bucket, client):
    """
    params:
    - prefix: pattern to match in s3
    - local: local path to folder in which to place files
    - bucket: s3 bucket with target contents
    - client: initialized s3 client object
    """
    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket':bucket,
        'Prefix':prefix,
    }
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    for k in keys:
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        client.download_file(bucket, k, dest_pathname)def lambda_handler(event, context):
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font('Arial', size=20)session = boto3.Session()
    s3_client = session.client('s3')
    bucket_name='filestorageexchange'
    s3_folder='main_folder/sub_folder'
    lambda_write_path = '/tmp/'
    download_dir(prefix=s3_folder, local=lambda_write_path, bucket=bucket_name, client=s3_client)for item in os.listdir(main_path := os.path.abspath(os.path.join(lambda_write_path, s3_folder))):
        for folder in os.listdir(sub_path := os.path.join(main_path, item)):
            for file in os.listdir(sub_folder_path := os.path.join(sub_path, folder)):
                Converted = False
                file_path = os.path.join(sub_folder_path, file)
                print(f'\nProcessing text file...{file_path}')
                pdf_file_name = file_path.replace(file_path.split('.')[1], 'pdf')
                s3_folder = 'main_folder' + '/' + 'sub_folder' + '/' + item + '/' + folder
                s3_object = pdf_file_name.split(os.sep)[-1]
                try:
                    if file_path.endswith('txt'):
                        pdf.cell(200, 10, txt="".join(open(file_path)))
                        pdf.output(os.path.join(lambda_write_path, pdf_file_name))
                        Converted=True
                    if file_path.lower().endswith(('png', 'jpg', 'gif', 'tif')):
                        pdf_png = pytesseract.image_to_pdf_or_hocr(file_path, extension='pdf')
                        with open(os.path.join(lambda_write_path, pdf_file_name), 'w+b') as f:
                            f.write(pdf_png)
                        Converted=True    
                    if file_path.endswith(('pcd')):                       
                        Image.open(file_path).save(temp_file:=file_path.replace(file_path.split('.')[1], 'png'))
                        pdf_png = pytesseract.image_to_pdf_or_hocr(temp_file, extension='pdf')
                        with open(os.path.join(lambda_write_path, pdf_file_name), 'pdf', 'w+b') as f:
                            f.write(pdf_png)
                            os.remove(temp_file)
                        Converted=True
                except Exception as e:
                    print(e)if Converted:
                    print(f"Created - {os.path.join(lambda_write_path, pdf_file_name)}")
                    with open(os.path.join(lambda_write_path, pdf_file_name), 'rb') as data:
                        s3_client.upload_fileobj(data, bucket_name, s3_folder + '/' + s3_object)
                    print(f"Uploaded to - {s3_folder + '/' + s3_object}")
                else:
                    print(f"Not Created - {os.path.join(lambda_write_path, pdf_file_name)}")if __name__ == "__main__":
    if os.name == 'nt':
        pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    else:
        pytesseract.pytesseract.tesseract_cmd = r'tesseract/4.1.1/bin/tesseract'
    
    lambda_handler(None, None)

Note: app.py is written and tested in 3.8

app.py does the following:

Logins to AWS using appid and appsecret
Downloads the files from the S3 bucket
Converts them using pytesseract, fpdf, and pillow
Writes back to S3

requirements.txt Contents:

fpdf
pillow
pytesseract
awscli
boto3

requirement.txt file does the following:

Collects all the above python packages.
Installs all the above packages.

Steps to create a docker image:

Prerequisite:

Install docker and make sure it is running in the background
Install docker extension in vs code (optional). This will let you easily browse the image and see logs and many more things.

Build:

docker build -t <image-name> .

or right-click on the dockerfile in vscode and click → build an image from Dockerfile

Once you create this, you’ll see something like this on the vs code extension. The idealoctopotato is the name I had given building it 16 hours ago.

Or you can execute this in terminal

docker images

This will show you all the docker images in the system.

Run:

To run the image enter the following command in the terminal and set a port

docker run -p 9000:8080 <image-name>

Once you do that the terminal will show you this

docker run -p 9000:8080 idealoctopotato:latest                            
time="2021-07-18T15:54:14.975" level=info msg="exec '/var/runtime/bootstrap' (cwd=/var/task, handler=)"

So this image is running in localhost port 8080/tcp and 9000 is the binding address.

List the container images:

docker container ls
CONTAINER ID   IMAGE                    COMMAND                  CREATED         STATUS         PORTS                                       NAMES
9c19c488f2dc   idealoctopotato:latest   "/lambda-entrypoint.…"   16 hours ago   Up 16 hours  0.0.0.0:9000->8080/tcp, :::9000->8080/tcp   dreamy_chebyshev

Vs code docker extension helps you navigate into the image directory and verify the files and contents.

Test:

Open another terminal and type

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d "{}"

This will start app.py processing inside the container.

You can switch back to the terminal where you started running the docker earlier and see the logs from the application while it runs.

What is ECR

From AWS docs —

Amazon Elastic Container Registry (ECR) is a fully managed container registry that makes it easy to store, manage, share, and deploy your container images

So the idea is to use ECR and host the docker image in the cloud which we just created locally.

Steps to deploy to ECR

Go to Amazon Elastic Container Registry
Create a private repository as lambda doesn't support public ones :

3. Now use this command to log in and obtain privileges

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com

If everything is ok, it should give you this —

Login Succeeded

4. Next use this command to add a tag to the image you have created

docker tag <docker_image_id> ecr_uri

image id — can be obtained by running docker images in the terminal and copying the id from the image.

ecr_uri — Copy the URI from the repository in the console.

5. Run this to push the image to the repository in AWS.

docker push ecr_uri

Once you do that, you should see something like this

When everything is pushed, it will show you —

latest: digest: sha256:2a0c7a019ddabaf92babxxxxxxxxxxxa1879e size: 3673

Now, you can refresh the AWS console and see the latest tag there with your image.

Steps to deploy to AWS Lambda

Now that your image is in ECR, open the Lambda console

Select container image option in AWS lambda function page.
Put the image URI and create

This should look like

aws_account_id.dkr.ecr.us-east-1.amazonaws.com/python-aws-docker:latest

Once done it will show you this —

Test using lambda

Hit test in the AWS lambda Cloud9 ide and you can see the same application logs that you just saw in docker locally.
Visit S3 and verify the pdfs by downloading them.
Check the cloudwatch logs.

The final result —

That's it! You’ve just built a serverless Amazon Lambda Function with pytesseract using Docker container image from Amazon Elastic Container Registry which converts various image formats to pdf that reside in S3.

About the Author:

I am Kuharan Bhowmik and I hope you have enjoyed this article. The above work is still in progress and is a private repository currently hosted in Github. I welcome any suggestion on the same.

Let's make the internet a better place by contributing our bit of knowledge.

Github —

kuharan - Overview

Data Engineer | Python | Azure | AWS |

github.com