Build Tesseract serverless API using AWS lambda and Docker
Motivation
This tutorial helps create a highly scalable/low cost Tesseract 4 API service using Docker and run by python libraries. It can be a great starting point for those who want to setup an affordable Demo service (use a REST API inside a Web app) or want to swap directly to a highly scalable production service.
I’ll go through how to create a REST API that takes an encoded image (and some Tesseract params) as input and returns the resulting text as response.
Updates :
- Improved python layers installer
- New simple webapp demo hosted in github.io (link below)
Note1 : for cost reasons the demo Api has been limited in number of requests
Note2 : we use python3.7 in this tutorial. If you use a different version, change it in the Dockerfiles.
1. Basic requirements
We need AWS account to access the console services (We will use AWS lambda & API Gateway).
Required tools :
- Docker (for setting your lambda layers)
- Git
- Postman for testing your API (optional)
Note : Tested in a linux PC. For Windows or MacOS, few changes may apply in the bash script.
2. Setup
2.1. Create lambda layers
Before starting, small point to know about lambdas. For python environment lambda, it is important to install only what is needed, this will improve the performance of your service and decrease the cost.
Check lambda limitations here.
Let’s start, Ready…Go !
Clone the repo
If you want to install the layers at once, run the command below then go to 2.2. for the next steps.
cd lambda-tesseract-api/; sudo bash build_all.sh
Otherwise check the details below.
2.1.1. Build Tesseract lambda Layer
Great work has been done by bweigel on building the lambda layer for tesseract 4.0.0. I added the tessconfigs folder and .trainedata files in the Dockerfile. These are important if you want to use some advanced Tesseract functions (ex. detecting text boxes …).
In the Dockerfile-tess4 :
- Edit #line 15 If you want to change the second language Corpus (fra by default).
- Edit #line 16 to specify which Datafile to use (standard is used by default). Better the corpus better but slower the Ocr results will be.
To build the layer type this command (it will take some time):
sudo bash buildtesseract/build_tesseract4.sh
2.1.2. Build Python libraries lambda Layers
We will use python 3.7 and will install these libraries inside a lambda layer : Pytesseract / Pillow (included)/ Opencv(optional).
Pytesseract is a python wrapper used to call Tesseract engine. Pillow & OpenCV can be used for image loading, processing and saving.
Run the cmd to build the python layer :
sudo bash buildpy/build_py37_pkgs.sh
Now you have some *.zip files that will appear in your folders. Your lambda layers are ready for uploading into the AWS console.
2.2. Upload layers in AWS lambda console
Easy step, just go to lambda>layers tab and upload zip packages one by one. An example of loading Pytesseract in the picture below
Note : You need to choose Python 3.7 in Compatible Runtimes for all the layers
2.3. Create lambda function
Go to functions tab in lambda and create a new function, check image below for details.
2.4. Setup the lambda function
In the Designer part click on Layers>add layers, then add all the layers that have been uploaded and click Save (check image below)
Now in the Designer part again click on the lambda name tesseract-demo. Just under it, you can access the Function code, c/p the code below inside the lambda_handler function.
The function is straightforward, it extracts the objects from the json body, decodes/saves the image, applies the Ocr and returns it.
In Basic settings, increase the Memory and Timeout. For Tesseract, more Memory faster the API response will be (need to be tested). For security you can put 1 min (for images with a lot of text) and 500+MB (see image below for setup).
Click save. Time to test our lambda and check if everything works fine !
2.5. Test the lambda function
The image I’ll be testing (decoded version below) is put inside a .json file with some Tesseract parameters.
The json body architecture contains the encoded image and Tesseract parameters. Check this link if you want to know more about Tesseract params.
Now create a test event c/p the JSON body above and save, you get something like this :
After saving. When you test your lambda, the response should be successful.
2.6. Create API
Create a Rest API as shown below
Actions > create Method
Specify the corresponding lambda function. No need to toggle proxy_integration
After creating the POST method, enable CORS by clicking : Actions > Enable CORS
2.7. Test API
For console testing : click on the POST method then click the TEST button (shown below)
Add the json (same used for lambda testing) inside the Request Body and click Test. You should get a 200 Status response.
If you get similar response as above, it is time to deploy your API:
Click Actions > Deploy: then copy the link of the invoke Url. You can get API URL also from your lambda function console (in the API block).
Open Postman, create new POST method, paste URL and c/p json in the Request Body and click Send. You should get a 200 Status response.
2.8. Bonus : Use serverless Api within a Web app
A simple serverless web app demo Here shows how everything is working (code in repo)
Done!
Not sure if this is the cleanest way to set everything up, but it does the job ! You should look at other Frameworks like Serverless which may be a great option if you want to automate the installation and make it more modular. However, you are free to tweak scripts and Dockerfiles to your needs. If we can make it with Tesseract, you should be able to make it with almost any type of tasks that transform images (ex. Computer Vision and/or Deep learning … ). Best to come!
If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below.
References :
[Ocr Layer] https://github.com/bweigel/aws-lambda-tesseract-layer