OCR Engine Comparison — Tesseract vs. EasyOCR

Published in

The Startup

4 min readJul 28, 2020

Intro: Optical Character Recognition (OCR) becomes more popular as document digitalization evolves. More and more companies are looking for automating documentation, and OCR plays a vital role in processing image-based documents.

Summary: This article discusses the main differences between Tesseract and EasyOCR using Python API, two popular free OCR engines in the market, from the images I tested.

The main function I used for pytesseract (v0.3.4):

pytesseract.image_to_pdf_or_hocr(file, extension=’hocr’)

The main function I used for easyocr (v1.1.8):

reader = easyocr.Reader([’en’], gpu = True)reader.readtext(file)

1: Output format

Input Data:

Tesseract: hocr-format bytes (xml) with bounding box (x1, y1, x2, y2) as coordinates for the text.

EasyOCR: a list with bounding box [[x1, y1], [x2, y1], [x2, y2], [x1, y2]] as coordinates for the text. Notice that the outputs for alphabets are lowercased.

2. Accuracy

I tested accuracy on two scenarios: numbers and text with 1000 samples respectively. I generated random alphabets/numbers on a blank image and use Tesseract and EasyOCR to parse the image.

Scenario 1 — alphabets: 1000 sample images with two words generated by random_words.

Input Data:

Scenario 2 — numbers: 1000 sample images with with 5-digit and 2 decimal points number.

Input Data:

Below are the detailed comparison and most common errors I noticed:

It is interesting to see that Tesseract does a better job in alphabet recognition while EasyOCR in number recognition. In addition, they have quite different problems in identifying certain characters. For example, Tesseract tends to interpret something like 29977.23 into 2997.23, or carrier into cartier. On the other hand, EasyOCR is more likely to convert 94268.1 into 94268, or advances into atvances.

3. Speed:

I tested the speed on both CPU and GPU machines with 1000 samples. The resolution of sample images is 200*50.

Input Data for CPU test:

Input Data for GPU test:

Below are the results:

In terms of speed, Tesseract outperforms EasyOCR on CPU, while EasyOCR performs amazingly on GPU.

The codes for accuracy and speed testing can be found below:

Conclusion

As per my testing, Tesseract performs better on alphabet recognition, while EasyOCR does a better job on numbers. If your document is alphabet-heavy, you may give Tesseract higher weights. Besides, the outputs from EasyOCR are lowercased. If capitalization is important for your processing, you should also use Tesseract. On the other hand, if your document contains a lot of numbers, you may favor EasyOCR. If you want the most accurate results, a hybrid process may be considered.

When it comes to speed, Tesseract is more favorable on a CPU machine, but EasyOCR runs extremely fast on a GPU machine.

The results are based on white background images with black text, arial font, and 15 font size. I have noticed that when processing images with more content such as outlines or other text, it may change the outcome from the same coordinates.

These are just my personal opinions based on my tests. Though the article just discusses two of many OCR engines, I hope it helps a little when one chooses which OCR engine to start with. Feel free to leave any comments and I would love to discuss with you in more details.

References

tesseract-ocr/tesseract

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new…

github.com

JaidedAI/EasyOCR

Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai. See this Colab Demo. You…

github.com

How to use - random_words documentation

from random_words import LoremIpsum >>> li = LoremIpsum () >>> li . get_sentence () 'Luctus molestie mazim netus…

randomwords.readthedocs.io

Python - difference between two strings

I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word…

stackoverflow.com

Selectors - Scrapy 2.2.1 documentation

Scrapy selectors are instances of class constructed by passing either object or markup as an unicode string (in…

docs.scrapy.org

Create images with Python PIL and Pillow and write text on them

Pillow is a fork of PIL. You should use Pillow these days. Before you can use it you need to install the Pillow…

code-maven.com

chejuiliao/ocr_engines

This project is exploring different ocr engines and comparing them This document is for people who want to observe the…

github.com

OCR Engine Comparison — Tesseract vs. EasyOCR

1: Output format

2. Accuracy

3. Speed:

Conclusion

References

tesseract-ocr/tesseract

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new…

JaidedAI/EasyOCR

Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai. See this Colab Demo. You…

How to use - random_words documentation

from random_words import LoremIpsum >>> li = LoremIpsum () >>> li . get_sentence () 'Luctus molestie mazim netus…

Python - difference between two strings

I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word…

Selectors - Scrapy 2.2.1 documentation

Scrapy selectors are instances of class constructed by passing either object or markup as an unicode string (in…

Create images with Python PIL and Pillow and write text on them

Pillow is a fork of PIL. You should use Pillow these days. Before you can use it you need to install the Pillow…

chejuiliao/ocr_engines

This project is exploring different ocr engines and comparing them This document is for people who want to observe the…

Written by Chejui Liao