OCR — Tesseract with Image Pre-processing

Wired Wisdom
The Startup
Published in
4 min readMay 21, 2020

--

Let’s discuss about how to work with tesseract which is very useful with OCR (optical character recognition) but it won’t give you much good results without any pre processing. So let’s see how it can be done. I would try to make it as simple as possible.

source

Let’s install the required libraries first.

# Installing tesseract
$ sudo apt-get install tesseract-ocr
# Installing other important libraries
$ pip install opencv-python
$ pip install pillow
$ pip install pytesseract

So in my case I have a collection of pdf files which I tried using libraries like textract to get text from the pdf. But it didn’t helped me. So I thought of converting pdf to image using library called pdf2image and saving it in a folder for the rest of the purpose.

from pdf2image import convert_from_path
import os
class ExtractFeatures:
def __init__(self, filename):
dir_path = os.getcwd() +'/Dataset/samples'
self.full_path = dir_path + '/' + filename
self.filename = filename
self.outputDir = os.getcwd() + '/Dataset/sample_images/' + filename.split('.')[0] + '.jpg'
pages = convert_from_path(self.full_path, 500)
for page in pages:
page.save(self.outputDir, 'JPEG')
if __name__ == "__main__":
dir_path = os.getcwd() +'/Dataset/samples'
for…

--

--