Optical Character Recognition(OCR) — Image, Opencv, pytesseract and easyocr

Nanda Coumar
6 min readJul 19, 2020

--

In this article, I would like to show the performance of two OCR libraries(pytesseract and easyocr).

So, What’s an OCR means ?

Optical Character Recognition or OCR, is a technology that enables you to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data.

About Tesseract:

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages. Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page. Tesseract is compatible with many programming languages and frameworks through wrappers that can be found here. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single text line.

How Tesseract Works ?

Tesseract developed from OCRopus model in Python which was a fork of a LSMT in C++, called CLSTM. CLSTM is an implementation of the LSTM recurrent neural network model in C++.

OCR process

Tesseract was an effort on code cleaning and adding a new LSTM model. The input image is processed in boxes (rectangle) line by line feeding into the LSTM model and giving output. In the image above we can visualize how it works.

About EasyOCR:

EasyOCR is a python based OCR library which extracts the text from the image. Its a ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai. It’s an open source project licensed under Apache 2.0.

How EasyOCR works ?

EasyOCR Framework

EasyOCR does the certain pre-processing steps(gray scaling and etc.,) within its library and extracts the text. It also applies the CRAFT(Character Region Awareness for Text Detection) algorithm to detect the text. CRAFT is a scene text detection method to effectively detect text area by exploring each character and affinity between the characters. The recognition model uses CRNN.. The sequencing labelling is performed by LSTM and CTC(Connectionist Temporal Classification), here the CTC is meant for labelling the unsegmented sequence data with RNN.

Installation of both:

In the above image , I am installing pytesseract using pip and for easyocr I would like to use the the latest easyocr, hence I cloned from git and installing the same.

pytesseract — API

By default, tesseract expects two main configs, which are the page segmentation and the OCR engine. There are almost 14 page segmentation(psm).

There is an important parameter ie., OCR engine mode (oem). Tesseract has two OCR engines — Legacy Tesseract engine and LSTM engine. There are four modes of operation chosen using the — oem option.

EasyOCR — API

EasyOCR supports multiple hyper-parameters for readtext method. The hyper-parameters are included under each layer of the processing mechanism. For eg., the decoding layer supports “greedy”,”beamsearch”,”wordbeamsearch”. Here is a good article explains about the beamsearch — a kind of decoding the noisy text to clean readable format. EasyOCR also support hyper-parameters that belongs to CRAFT as well. One can find the different hyper-parameters here that can be used within easyocr. EasyOCR also supports GPU processing.

Comparison:

Initially I took a clean text image with a white background as shown below.

Example Image

With this image, the tesseract outputs the complete sentences. However the easyOCR outputs an array with co-ordinates of the text, the actual text and the confidence value of the text.

output of tesseract

From above, the tesseract extracts the complete sentence correctly.

output of EasyOCR

EasyOCR’s output has a nested array where first element gives the co-ordinate axis which can be used to mark the text within the image. Next the actual text and the last is the confidence value. When we form the sentence from the. extracted text, the order in which the texts are extracted has some discomfort.

From the above example the actual text is “sometimes, you just need a break. in a beautiful place. alone. to figure everything out.

Output from easyOCR is “‘sometimes just you need break in a a beautiful place alone. to figure everything out

EasyOCR output as sentence.

Next..! I took an another image, where there are certain backgrounds along with a text, the image is below.

Image with background

In this case, tesseract could not able to recognize the text. I have tried with different pre-processing such gray scale, removing noise, eroding and so. The pre-processed images are shown below.

pre-processed images

The behaviour of EasyOCR on these kinds of image were pretty good and was able to extract the text in much meaningful manner.

An Eagle’s Eye View of OCR Accuracy:

The Quality Of Your Source Image — If the quality of the original source image is good, i.e. if the human eyes can see the original source clearly, it will be possible to achieve good OCR results. But if the original source itself is not clear, then OCR results will most likely include errors. The better the quality of original source image, the easier it is to distinguish characters from the rest, the higher the accuracy of OCR will be.

Common metrics to evaluate a OCR would be Character Error Rate(CER) and Word Error Rate(WER).

Based on my observations I have listed down the advantages of tesseract over easyOCR and vice-versa.

Advantages of Tesseract:

  • Tesseract supports customized pre-processing layer based on the user’s need.
  • Tesseract work pretty faster with multiple images.
  • Tesseract gives the output as a sentence which is not the case with easyOCR.
  • Tesseract performance is directly linked towards the quality of the image.
  • Tesseract has the configuration to extract only the digits.
  • Tesseract also support training on customized data as well. you can get the details here

Advantages of EasyOCR:

  • EasyOCR supports the GPU version and performance is good on GPU.
  • EasyOCR provides the confidence of the extracted text which can be used to analyze further.
  • EasyOCR works better with noisy images when compared with tesseract.
  • As EasyOCR uses CTC, It would bring better results when the

Limitations of both Tesseract and EasyOCR:

I would like to point out the general limitations of both Tesseract and EasyOCR.

  • In general Poor quality scans may produce poor quality OCR.
  • If a document contains languages outside of those given in the LANG arguments, results may be poor.
  • On handwritten text both would give low results.
  • Doesn’t do well with images affected by artifacts including partial occlusion, distorted perspective, and complex background.

Conclusions and Future Work:

Both Tesseract and EasyOCR is good for scanning the clean documents and would result with the higher accuracy. Both supports LSTM.

The Future work would be analysing the behaviour of Tesseract and easyOCR on the invoice documents.

References

Thanks for reading..!!

--

--