Replacing Google-OCR With Tesseract and Saving Thousands in Monthly Billings

Nitin Agarwal
The Startup
Published in
9 min readFeb 19, 2021

We had developed a feature for a game where we needed to create a room and constantly check the number of users who have joined the room.

This started off as an experiment, but as our users really liked the feature, very soon we had this running on 30 machines.

To figure out the number of users who had joined the room, we counted the “INVITE” string in the picture above. We used Cloud Vision Text Detection APIs as it was very easy to use, and the results were also accurate.

The only problem with these APIs was the high cost. We were paying approx 3000 USD for about 2 million Cloud Vision API request in a month. In the coming months we had to increase the number machine to around 150, that would mean paying ~15000 USD a month.

A few months ago I had gone over this article about using AI to improve your tennis serve, and one of the points in the article was about the high cost of using Google APIs and how training a custom model and running it on our servers can help reduce the cost. This gave us the idea to build a custom solution to fix the high-cost problem.

Tesseract

While looking for cheaper alternates for OCR, we came across Tesseract. Tesseract is an OCR engine with support for Unicode and the ability to recognize more than 100 languages out of the box. It can be easily trained to recognize other languages, and unique fonts, which was required in our use case

We could have also utilised other means to figure out the bots count, like using OpenCV to find white colour pixels of “INVITE” string. But we went ahead with Tesseract, as we had a requirement to build custom OCR for another project.

When using Tesseract, you really don't need to understand what is going on behind the scenes, and you can use it like any other simple software, as you will see in this article below.

But as an engineer, it is always a good practice to understand the basics, as that will really help in case you get stuck or you are not getting the right results. Tesseract 4.0x+ uses LSTM neural networks, while this topic in itself is a really complex one, but I highly recommend watching the below video series to get an overview of what are neural networks. The below videos takes the example of image recognition itself to explain the concept, which is an added benefit :).

Tesseract can be easily installed, on mac, you can use brew install tesseract, on windows Tesseract executables can be easily downloaded. Tesseract version used by us was 4.1

Another option is to get the source code https://github.com/tesseract-ocr/tesseract and build it locally using simple commands mentioned in the below link https://tesseract-ocr.github.io/tessdoc/Compiling.html

After installation, you can try out this tool on the command line for simple English text like in the screenshot below

Before using tesseract on an image, it is recommended to clean the image of all the noise, as this can greatly improve text detection.

image detection before after clean up

As you can see from the above example, merely after cleaning the image, Tesseract is able to return the correct result.

But if you try with an image with not so cleaner text and with an uncommon font, as with the image below, Tesseract is not able to detect the results correctly.

But this problem can be easily fixed, by a simple 10 min process of training.

Training Process

Let's walk through an example of the training process. If you are trying to train to identify a specific string in an image, even training with 3–5 images should be enough, like in my case we had to recognise “INVITE” string.

Steps Involved

  • Install Tesseract with training tools. You can find the steps in this article.

Issues with installing Tesseract training Tools

Those who do not run into issues with installing the training tools can skip this section.

I use macOS Catalina, and due to some dependency issue was not able to install the training tools,.

As a workaround I used Ubuntu-based Docker image to download and build Tesseract training tools and used commands like docker cp and docker exec /bin/bash to train for the new font.

DockerFile Tesseract 4.1 installationFROM ubuntu:18.04RUN apt update# Install everything we'd need in a build environment within Docker
RUN apt install \
wget \
git \
build-essential \
curl \
autoconf \
automake \
libtool \
pkg-config \
apt-transport-https \
ppa-purge \
zsh \
screen \
byobu \
parallel \
iperf3 \
iotop \
atop \
nethogs \
htop \
software-properties-common \
-y
################################################################################
###### Everything above this line can be part of a ubuntu-build base image #####
################################################################################
# Install Dependencies for the software we wish to compile
RUN apt install \
libicu-dev \
libpango1.0-dev \
libcairo2-dev \
libleptonica-dev \
-y
# Download software and extract to a working directory
ADD https://github.com/tesseract-ocr/tesseract/archive/4.1.0.tar.gz /
RUN tar -xf 4.1.0.tar.gz
# Switch to this working directory
WORKDIR /tesseract-4.1.0
# Run build commands
RUN ./autogen.sh \
&& ./configure --disable-shared \
&& make \
&& make install \
&& ldconfig \
&& make training \
&& make training-install
RUN apt-get install ffmpeg libsm6 libxext6 -yRUN mkdir train

building the above Docker image will take some time

docker run -it --entrypoint=/bin/bash nitin-tesseract:1

The training process is explained in great detail in this article, I will try to summarise the steps below.

  • Install JTessBoxEditor, this tool will be used to create boxes, which will be further used for training.
  • Go-To Tools, click on Merge TIFF, choose PNG as the file format and select the image you want to train and save it as invite.font.exp0

Here invite is the name given by me to the new font.

  • Open the terminal and run the below command to create boxes
tesseract --psm 6 --oem 3 --dpi 96 invite.font.exp0.tif invite.font.exp0 makebox

psm — Page Segment Mode, this affects how Tesseract splits the image into lines of text and words

oem — Engine Mode, you can choose between various engines, like legacy, or the LSTM engine introduced in version 4.

https://ai-facets.org/tesseract-ocr-best-practices/

  • Now open the box editor tab in jTessBoxEditor and click open and select the invite.font.exp0.tif file. You will see an image like below with bounding boxes and predictions, You now need to manually fix the boxes and the prediction.
manually fixing the bounding boxes in JTessEditor

Run the below commands in the same directory containing the .tif and .box files to created traineddata file.

echo font 0 0 0 0 0 > font_propertiestesseract invite.font.exp0.tif invite.font.exp0 nobatch box.trainunicharset_extractor invite.font.exp0.boxshapeclustering -F font_properties -U unicharset -O invite.unicharset invite.font.exp0.trmftraining -F font_properties -U unicharset -O invite.unicharset invite.font.exp0.trcntraining invite.font.exp0.trmv normproto invite.normprotomv pffmtable invite.pffmtablemv shapetable invite.shapetablemv inttemp invite.inttempcombine_tessdata invite.

After finishing the training steps properly, you should get your font_name.traineddata, in my case I had created invite.traineddata.

This file can now be used to predict “invite” text from the image.

In our use case, we only trained for the “INVITE” string and the model is able to predict correctly. In case some other characters are provided, the model does not work correctly, but if we train for other characters with the above-mentioned steps, Tesseract will be very easily able to predict the text.

Running Tesseract in the Cloud

Training and being able to run this solution was just one part of the problem, We also needed to run this cheaply in the cloud also with the ability to scale automatically as with the increase in traffic.

We created a simple flask application that accepts the image and returns the bot count in the image.

DockerFile Python and Flask AppRUN apt-get update \
&& apt-get install -y python3-pip python3-dev \
&& cd /usr/local/bin \
&& ln -s /usr/bin/python3 python \
&& pip3 install --upgrade pip \
&& apt-get -y install sudo \
&& apt-get clean \
&& apt-get autoremove

WORKDIR /app

ADD requirements.txt ./

RUN pip3 install -r requirements.txt

ADD app ./
ADD tessdata ./tessdata/ #folder contains the trained font

ENV FLASK_APP=app
ENV FLASK_DEBUG=1
EXPOSE 5000 5000
CMD gunicorn -w 2 --threads 2 app:app --bind 0.0.0.0:5000 --backlog 1024 --timeout 120

We choose Google Kubernetes Engine to run our application. Running image detection is CPU intensive tasks, GKE gives us the ability to choose any machine from the variety of machine types it supports which best suits our use case and also gives us auto-scaling, monitoring etc out of the box.

GCP offers different kinds of machine types like C2 which is compute-optimized and general-purpose machines like N2. We did some basic level performance testing using Apache JMeter to check the throughput. After performing a few simple tests, we found that N2 and C2 machine with the same CPU were providing similar performance. Since N2 is much cheaper of the two, we went ahead with n2-highcpu-4 machine type.

+--------------+--------+-------------+
| Machine name | vCPUs1 | Memory (GB) |
+--------------+--------+-------------+
| n2-highcpu-4 | 4 | 4 |
+--------------+--------+-------------+

After load testing, we found that one single instance was able to provide a throughput of approx 50 req/sec.

In terms of API performance, while Google API took around 1 second to respond, we have got an avg latency of around 250ms.

Cost Savings

The cost of adding a new node of machine type n2-highcpu-4, is around 85 USD a month. This single instance is also capable of handling at least 8x-10x the current traffic. This is a huge saving compared to the cost of almost 3000–3500 USD a month, which is expected to grow many folds in the coming months.

Conclusion

Tesseract, is simple to use, accurate, and an extremely cheap tool to build a custom solution for OCR needs. It can detect normal fonts out of the box after simply cleaning the image and can detect complex new fonts post a simple 10-minute training process. It provides extremely fast performance without requiring any high-end computation power.

In our use case, we had a very simple requirement of detecting a fixed string, but we also tried to detect player names in the game room image(which can also be non-English characters) and after the simple training process the results we found were equally encouraging.

Recently, we got an optimization requirement to reduce the resolution to 960*540, as this would have been beneficial for the servers running the automation setup. Post this change, we saw that Google OCR was no longer giving accurate results, while with minimal efforts, we were able to optimize our Tesseract solution to give us 100% accurate results. :)

--

--