Build Tesseract serverless API using AWS lambda and Docker

Published in

Analytics Vidhya

6 min readFeb 26, 2020

Motivation

This tutorial helps create a highly scalable/low cost Tesseract 4 API service using Docker and run by python libraries. It can be a great starting point for those who want to setup an affordable Demo service (use a REST API inside a Web app) or want to swap directly to a highly scalable production service.

I’ll go through how to create a REST API that takes an encoded image (and some Tesseract params) as input and returns the resulting text as response.

Updates :

Improved python layers installer
New simple webapp demo hosted in github.io (link below)

Note1 : for cost reasons the demo Api has been limited in number of requests

Note2 : we use python3.7 in this tutorial. If you use a different version, change it in the Dockerfiles.

1. Basic requirements

We need AWS account to access the console services (We will use AWS lambda & API Gateway).

Required tools :

Docker (for setting your lambda layers)
Git
Postman for testing your API (optional)

Note : Tested in a linux PC. For Windows or MacOS, few changes may apply in the bash script.

2. Setup

2.1. Create lambda layers

Before starting, small point to know about lambdas. For python environment lambda, it is important to install only what is needed, this will improve the performance of your service and decrease the cost.

Check lambda limitations here.

Let’s start, Ready…Go !

Clone the repo

If you want to install the layers at once, run the command below then go to 2.2. for the next steps.

cd lambda-tesseract-api/; sudo bash build_all.sh

Otherwise check the details below.

2.1.1. Build Tesseract lambda Layer

Great work has been done by bweigel on building the lambda layer for tesseract 4.0.0. I added the tessconfigs folder and .trainedata files in the Dockerfile. These are important if you want to use some advanced Tesseract functions (ex. detecting text boxes …).

In the Dockerfile-tess4 :

Edit #line 15 If you want to change the second language Corpus (fra by default).
Edit #line 16 to specify which Datafile to use (standard is used by default). Better the corpus better but slower the Ocr results will be.

To build the layer type this command (it will take some time):

sudo bash buildtesseract/build_tesseract4.sh

2.1.2. Build Python libraries lambda Layers

We will use python 3.7 and will install these libraries inside a lambda layer : Pytesseract / Pillow (included)/ Opencv(optional).

Pytesseract is a python wrapper used to call Tesseract engine. Pillow & OpenCV can be used for image loading, processing and saving.

Run the cmd to build the python layer :

sudo bash buildpy/build_py37_pkgs.sh

Now you have some *.zip files that will appear in your folders. Your lambda layers are ready for uploading into the AWS console.

2.2. Upload layers in AWS lambda console

Easy step, just go to lambda>layers tab and upload zip packages one by one. An example of loading Pytesseract in the picture below

Note : You need to choose Python 3.7 in Compatible Runtimes for all the layers

2.3. Create lambda function

Go to functions tab in lambda and create a new function, check image below for details.

2.4. Setup the lambda function

In the Designer part click on Layers>add layers, then add all the layers that have been uploaded and click Save (check image below)

Now in the Designer part again click on the lambda name tesseract-demo. Just under it, you can access the Function code, c/p the code below inside the lambda_handler function.

The function is straightforward, it extracts the objects from the json body, decodes/saves the image, applies the Ocr and returns it.

In Basic settings, increase the Memory and Timeout. For Tesseract, more Memory faster the API response will be (need to be tested). For security you can put 1 min (for images with a lot of text) and 500+MB (see image below for setup).

Click save. Time to test our lambda and check if everything works fine !

2.5. Test the lambda function

The image I’ll be testing (decoded version below) is put inside a .json file with some Tesseract parameters.

Input json body used for this lambda function

The json body architecture contains the encoded image and Tesseract parameters. Check this link if you want to know more about Tesseract params.

Now create a test event c/p the JSON body above and save, you get something like this :

After saving. When you test your lambda, the response should be successful.

2.6. Create API

Create a Rest API as shown below

Actions > create Method

Specify the corresponding lambda function. No need to toggle proxy_integration

After creating the POST method, enable CORS by clicking : Actions > Enable CORS

2.7. Test API

For console testing : click on the POST method then click the TEST button (shown below)

Add the json (same used for lambda testing) inside the Request Body and click Test. You should get a 200 Status response.

If you get similar response as above, it is time to deploy your API:

Click Actions > Deploy: then copy the link of the invoke Url. You can get API URL also from your lambda function console (in the API block).

Open Postman, create new POST method, paste URL and c/p json in the Request Body and click Send. You should get a 200 Status response.

2.8. Bonus : Use serverless Api within a Web app

A simple serverless web app demo Here shows how everything is working (code in repo)

Done!

Not sure if this is the cleanest way to set everything up, but it does the job ! You should look at other Frameworks like Serverless which may be a great option if you want to automate the installation and make it more modular. However, you are free to tweak scripts and Dockerfiles to your needs. If we can make it with Tesseract, you should be able to make it with almost any type of tasks that transform images (ex. Computer Vision and/or Deep learning … ). Best to come!

If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below.

References :

[Ocr Layer] https://github.com/bweigel/aws-lambda-tesseract-layer