OCR — Tesseract with Image Pre-processing

Published in

The Startup

4 min readMay 21, 2020

Let’s discuss about how to work with tesseract which is very useful with OCR (optical character recognition) but it won’t give you much good results without any pre processing. So let’s see how it can be done. I would try to make it as simple as possible.

Let’s install the required libraries first.

# Installing tesseract
$ sudo apt-get install tesseract-ocr
# Installing other important libraries
$ pip install opencv-python
$ pip install pillow
$ pip install pytesseract

So in my case I have a collection of pdf files which I tried using libraries like textract to get text from the pdf. But it didn’t helped me. So I thought of converting pdf to image using library called pdf2image and saving it in a folder for the rest of the purpose.

from pdf2image import convert_from_path
import osclass ExtractFeatures:
def __init__(self, filename):
  dir_path = os.getcwd() +'/Dataset/samples'
  self.full_path = dir_path + '/' + filename
  self.filename = filename
  self.outputDir = os.getcwd() + '/Dataset/sample_images/' +    filename.split('.')[0] + '.jpg'  pages = convert_from_path(self.full_path, 500)
  for page in pages:
     page.save(self.outputDir, 'JPEG')if __name__ == "__main__":
dir_path = os.getcwd() +'/Dataset/samples'
for…

OCR — Tesseract with Image Pre-processing

Written by Wired Wisdom