Pytesseract — an optical character recognition library for Python

Dr. Ananth G S
3 min readJan 16, 2024

--

For many of us Jan 22nd, 2024 is a very auspicious day. The ಪ್ರತಿಷ್ಠಾಪನೆ (प्रतिष्ठापन) of Lord Rama will happen at his Janma Bhoomi Ayodhya and will be inaugurated by the Indian Prime Minister Sri Modiji.

Coming back home after a tiring and exhausting day and again to sit in front of the computer needs some motivation. Thanks to my mother!

The moment I spoke to her, she told me that she had written a song on a sheet of paper in Kannada and she wanted me to convert that text to Kannada — so that she can share it to a WhatsApp group of 700+ people. These people are chanting, speaking and doing everything just only about Lord Rama!!

So I asked her for the sheet of paper and it was well written in bold and striking Kannada characters. She quickly scanned it and sent me an image of the same.

What next?? …

Sat in front of my computer and did a quick Google search for Optical Character Conversion with Recognition FOSS.

It was a quick response along with some generated AI search from Google and with some help from ChatGPT — I came to know about this free and open source library for Python called the pytesseract. Then later on figured out to know that this library is a Python wrapper to Google’s Tesseract OCR Engine.

For a very brief history about this Engine…

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.

The latest version of this library is 5.3.3 released in October 2023. It supports a wide variety of languages, whose information can be seen in the ReadMe.

Now for the most important developer / code part..

I work on Ubuntu/*Deb platforms — so the following commands could be different to your !*Deb platforms….

Since its a python library — I don't want to take risk to modify my base setup and hence start with creating a custom conda environment using Python 3.10

conda create -n kanimg python==3.10

I need to install the main part of the supporting OCR library with support to the Kannada language with the command:

sudo apt install tesseract-ocr-kan 

Activate the environment “kanimg” now (No need to activate the env before this step as installing the ocr-kan library is Global to the OS) with the command…

conda activate kanimg

Now install the wrapper library…

pip install pytesseract

Lastly, rename the download Kannada image to some proper filename… Just in my case “rama.jpeg” and run the below Python code…

# import libraries..
from PIL import Image
import pytesseract

# Load the file...
img_path = 'rama.jpeg'

# Open the image file with PIL
img = Image.open(img_path)
# convert to grayscale..
gray = img.convert('L')

# A crucial step to load the library to the code path...
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

text=''
try:
text = pytesseract.image_to_string(img, lang='kan')
print(text)
except pytesseract.TesseractError as e:
print(f"An error occurred during OCR: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")

# If need be - redirect this code output to a file straight away.
# mod it using a word processor for further use.

The text variable will now hold your output—recognized in Kannada text. Mission Accomplished!!!

Note.. The code did execute properly only after the code path for the library executable (/usr/bin/tesseract) was set..

There were some amounts of error in the output for the wrong interpretation of OCR. You should modify at such instances.

Thats it.. I hope you enjoyed reading about this cute little library…

Till the next time.. Adieu!!!

--

--

Dr. Ananth G S

I have a Doctorate degree on RecSys. I am a passionate FOSS lover, using it full time for many projects over the past 20+ years