AWS Knowledge Series — OCR with AWS Lambda & Tesseract

Published in

Nerd For Tech

6 min readJun 19, 2021

In today’s world, it is important that your platform / service / application have genuine registered users. It increases the trust quotient of your offering when people know that other users on the platform are also genuine users like themselves. While there are many ways to ensure to weed out bots and impersonation, one popular way to do is to ensure that user take a picture of themselves while holding a placard containing random text. This random text is shown to the user during on-boarding or during profile verification process. It is important to only allow users to take picture using camera of their current device and not select a picture from device disk or photo album. This reduces the chances of forgery considerably. If you ever created an investor account with Zerodha, you would have uploaded your profile picture with a placard having random numbers during the registration process. In order to validate the profile, the backend service needs to check that the photo taken and uploaded by the user contains random text that was shown to the user during the onboarding / profile verification workflow. This requires backend service to do optical character recognition on the uploaded picture. AWS offers a very easy to use OCR APIs as part of the AWS Rekognition service. However, the cost of AWS Rekognition is very high — Processing a million images will cost you USD 1000! In this article, I will show you how you can use Google’s open source OCR library Tesseract within AWS Lambda to perform OCR.

What is Tesseract?

Tesseract (https://opensource.google/projects/tesseract) is an optical character recognition engine for various operating systems. It was developed by HP and open sourced in 2005. Its development has been sponsored by Google since. It can recognise more than 100 languages out of box. It comes with two language trained models for various supported languages — FAST and BEST. Tesseract is an executable that can read various image file types (JPEG, PNG, TIFF etc.) and read text present in the image files.

Profile Verification Workflow

We will build the following workflow. The objective of this exercise is to show how to use tesseract within AWS Lambda and not so much on getting a feature perfect profile verification workflow!

Getting Tesseract for AWS Lambda

AWS Lambda VMs run on AWS Linux so we need a Tesseract executable / libraries which is built for AWS Linux. Fortunately this hard work has already been done by Benjamin Genz (GitHub: https://github.com/bweigel, Twitter — https://twitter.com/dreigelb) and is available for download on GitHub. He has also documented how one can use Docker to create Tesseract version for AWS Linux. Download the pre-built Tesseract binary and library from here — https://github.com/bweigel/aws-lambda-tesseract-layer/tree/master/ready-to-use/amazonlinux-2

Tesseract, NodeJS and AWS Lambda VM

Our Lambda will be built for NodeJS 12.x runtime. The node-tesseract-ocr NodeJS module wraps the Tesseract executable using the child-process NodeJS module and exposes an easy to use Node APIs that we can use within our lambda function. However the node-tesseract-ocr module expects that the tesseract is installed and available in the environment where the NodeJS code is being executed. So we need to be able to execute the tesseract executable in the VM that is executing our Lambda code. So how does one execute an executable from Lambda? Well this is described here — https://aws.amazon.com/blogs/compute/running-executables-in-aws-lambda/. We will have to bundle the tesseract executable and its libraries along with out Lambda function code. By default, the Lambda VM executables are located in /opt/bin — But we are packaging the tesseract with out Lambda code so how do we make the executable and libraries available in PATH such that it can be executed? Well the answer is simple — Modify the environment variables! Here is an article that describes what are the various environment variables of Lambda VM — https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html. There are two environment variables of interest:

PATH — We have to modify the PATH environment variable such that the tesseract executable is available in the VM path.

LD_LIBRARY_PATH — This is the path pointing to the library modules that can be dynamically loaded. So we have to modify this to ensure that all sync modules used by tesseract executable are available in this path.

Lambda Functions

We will need to write two Lambda functions. The first one is used to generate the random 6 digit number that we need to show to the user. In this sample all APIs are un-authenticated but in production you will only allow registered users to call this API functions.

Lambda Function — Generating Profile ID and Random Numbers

This function does following tasks:

Generate UUID to be used as profile ID. In real application this will be the user’s identity ID.
Generate a random 6 digit number
Save the profile ID and random digit in S3 bucket as a JSON object for future use. In real application you may store this in S3 or DynamoDB or an RDS DB depending on your architecture. The idea is that we should be able to retrieve this information when the user sends us the picture so that we can do the comparison.
Sens the profile ID and random number to the client.

Lambda Function — Get Profile Verification Status

This function makes some assumption:

The user must have called the generate profile ID and random number API
The user must have uploaded the profile picture with random number to S3 bucket
The name of the profile picture file must be the profile ID so that we can locate and load it

This function needs access to the Tesseract executable and library modules that are compiled for AWS Linux. Hence the folder structure for the Lambda has to be configured as follows:

You can see above that the tesseract binaries are located in a sub folder “tesseract”.

bin — This folder contains the tesseract executable

lib — This folder contains the dynamic libraries used by tesseract

tesseract/share/tessdata — This folder contain the trained model data used by tesseract. We are going to use the FAST version of the trained data model as the size is smaller and is accurate enough for our use case.

The following code ensures that the VM environment variables modified as discussed in “Tesseract, NodeJS and AWS Lambda VM” section above.

// Inspired by https://aws.amazon.com/blogs/compute/running-executables-in-aws-lambda/
// to make sure that tesseract executable and libraries are available
// in pathconst lambdaPath = process.env["LAMBDA_TASK_ROOT"];
const libPath = process.env["LAMBDA_TASK_ROOT"] + "/tesseract/lib";
const binPath = process.env["LAMBDA_TASK_ROOT"] + "/tesseract/bin";const dataPath = process.env["LAMBDA_TASK_ROOT"] + "/tesseract/tesseract/share/tessdata";// Add the tesseract in pathprocess.env["PATH"] = process.env["PATH"] + ":" + lambdaPath + ":" + binPath + ":" + libPath;// Add path to libraries required by teserract
process.env["LD_LIBRARY_PATH"] = process.env["LD_LIBRARY_PATH"] + ":" + lambdaPath + ":" + binPath + ":" + libPath;// Path where the training data is located - "fast" training data for english is used in this sample.
process.env["TESSDATA_PREFIX"] = dataPath;

The lambda function loads the image file uploaded by user and also the JSON file containing the random code that was sent to client from S3. Using the node-tesseract-ocr module we invoke the tesseract recognition to extract the text from image file and compare it with what was sent to the client to decide the verification status of the user profile.

Improving Speed / Accuracy of Text Recognition

There are various things you can do to improve the accuracy of the text recognition.

Perform the image processing as suggested here — https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#image-processing. You can use the sharp library for some of the image processing functions.
Since we know that we are sending a random 6 digit number and that is what we are expecting to be present in the image file, we can provide a pattern file that indicates that we should look for 6 digits. See the pattern.txt file in the Lambda code base and this article — https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#dictionaries-word-lists-and-patterns
To improve accuracy you can also use the BEST trained models instead of the FAST trained model. However this will increase the size of the Lambda code package as well as the recognition will be slower than using FAST.

Implementation & Testing

The entire sample is available on GitHub here — https://github.com/santhedan/ocrsample

The entire workflow can be tested using the testprofile.js NodeJS script. It uses the text-to-image, sharp NodeJS module for compositing the image with the random 6 digits.