Performing Optical Character Recognition with Python and Pytesseract using Anaconda

Pranav Manoj
Analytics Vidhya
Published in
5 min readAug 31, 2020

What is is Pytesseract and how reliable is it ?

Pytesseract is a wrapper for Google’s OCR engine.

That one line should most probably leave you extremely pleased. I mean come on. Google? And OCR ? That’s the point when you know it’s good.

That’s nice and all, but how do I get it up and running ?

Ok, time to start downloading stuff.

I’m writing this article assuming you’re using Anaconda, and trust me it’s significantly easier setting things up using Anaconda instead of doing it manually using pip. There’s just so much that can go wrong.

So first things first let’s get our hands on the OCR engine itself !

Head over to https://github.com/UB-Mannheim/tesseract/wiki and get the 32-bit or 64-bit version depending on your system architecture. If you don’t know which one to get, open your computer settings (windows key + I on windows) and type About.

After it’s done downloading just install it like a regular program (by double clicking and following the on-screen instructions). Now open up the folder where it’s downloaded and press Control + L. Now press Control + C. This should copy the path of the folder. We’re gonna be needing that.

Once that’s done , type system variables in the windows search box and hit enter when it says Edit the system environment variables.

System properties dialog box should pop open.

Now select Environment Variables.

Please be extremely careful here. You’re going to be editing your System Variables and if you mess up your computer goes FUBAR. I’m not kidding. But no pressure ;)

Edit system variables

Select Path. Then click on Edit not New. If you select New then you can kiss your computer goodbye because I’ve no clue how to undo the damage done.

I’ve stored mine in the D drive, but your path may be different (probably C)

Now select New and paste in the path we copied earlier. Hit ok.

Yay, we’re done with the tricky part ! Good job !

Getting the dependencies from Anaconda

Open up Anaconda Navigator and click on the Environments tab. Select Create and name it whatever you want, or just use whatever pre-existing environment you have, that’s up to you. However, if you create another environment remember to activate it using

conda activate your-environment-name 

before you run the code.

Now, in the drop-down menu on the top left click on where it says Installed and change it to All.

Next, in the search box type in pytesseract, and tick the little box on it’s left. Do the same for tesseract, tesserocr and goslate.

If you look at the bottom of the page you should see a green box saying Apply. Click it and when it asks for confirmation say Yes.

Finally, code time !

Whew ! A little bit of setting up eh ? All worth it in the end I’ll tell you.

Open up a new .py file and call it whatever you want. Paste in the following code:

from PIL import Imageimport pytesseractpytesseract.pytesseract.tesseract_cmd = r”D:\Tesseract\tesseract.exe” # put in the path to tesseract.exe in your computer here.image = Image.open(‘new.png’) # take ingrabbed = pytesseract.image_to_string(image, lang=’eng’)print(grabbed)

Here’s a sample image for you to try out. Save it as new.png in the same directory as the python file we wrote above.

A sample image for you to test out the engine

Did you feel that was too easy for the engine ? Well, that’s exactly what I thought. Here’s a tougher one. Remember to save it as new.png to go along with the code I’ve written or change the code if you change the name of the file on downloading.

A tougher sample image

Cool no ? But what else can you do with it ? How about some text translation ? :)

Translation

If you paid close attention to the downloads you’ll see we downloaded goslate, but didn’t use it in the code above. This happens to be another Google library which deals with translation. How convenient……what would we do without Google? Sigh.

Let’s write some code to implement translation.

from PIL import Imageimport pytesseractimport goslategs = goslate.Goslate()pytesseract.pytesseract.tesseract_cmd = r”D:\Tesseract\tesseract.exe” # put in the path to tesseract.exe in your computer here.image = Image.open(‘new.png’)grabbed = pytesseract.image_to_string(image, lang=’eng’)processed = gs.translate(grabbed, ‘en’)print(‘\n’)print(processed)

Here’s an image to try the code on.

This is German for ‘You are women and we are Men.’

Mind you goslate only let’s you make around 5 requests, after which any requests made from your IP won’t be answered for a cooldown period of 10 minutes. So, it might be a good idea to use a VPN to make more requests.

To get around this, I’ve thought of using web scraping to scrape the translation directly off google but that seemed a little hacky and is a story for another day :)

A mod to improve the experience

Here’s another block of code which will write all the output to a text file instead of the terminal or wherever you run the code:

from PIL import Imageimport pytesseractpytesseract.pytesseract.tesseract_cmd = r”D:\Tesseract\tesseract.exe” # put in the path to tesseract.exe in your computer here.image = Image.open(‘try.png’)image_to_text = pytesseract.image_to_string(image, lang=’eng’)print(image_to_text)doc = open(“extracted.txt”,”w”)doc.write(image_to_text)doc.close()wait = input(“Press Enter to exit”)

Wrap up

Well, that’s it for this time. If you have any doubts, feel free to ask away in the comments section, and I’ll be happy to help out. Peace.

--

--

Pranav Manoj
Analytics Vidhya

Love working with Python, Flutter and Go. I also dabble in Arduino and other IOT related projects 🔗 https://bossbeagle1509.dev