How does Optical Character Recognition (OCR) work?

What is OCR?

You know those super-duper important documents your boss gave you that are also taking up lots of physical space on your desk? Sure, you could scan them into your computer but what if those documents had a typo that needed to be fixed? Or a data-table you need to use? You could transcribe them yourself but literally why would you do that when you have OCR!

Optical character recognition solves the problem of computers being able to read and transcribe non-digital writing into something you can edit with a computer. There’s tons of different OCR tech online that you can use. Here’s a few examples:

So I know you’re thinking, “Cool, I’ve got this magic OCR thing but I need to know how it works — and also, how are you reading my mind?”

One thing you need to know before I tell you about OCR, is that that 3 pound ball of fat sitting in your skull is the smartest and most complex computer in the world. Meanwhile, the device you’re reading this off of isn’t!

What I mean to say is that when people build this kind of software, they have to break down the entire process of vision into building blocks that a computer can understand — meanwhile your brain kinda just does that by default.

Basically what building software looks like!

The first thing OCR does use a binarization method. Click here if you want to be confused or enlightened by the math behind this method. The first thing this method helps OCR software do is separate an image into bi-modal histogram. A what???? This thing!

Breakdown

OCR software initially takes note of how many pixels occur of the different colors on the page. So in this example above, we’ve got some black text on a white piece of paper.

This graph is how we read what the computer is doing but what does this graph look like to a computer? An array! So our graph would like something like this to a computer:

[2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,2,2,2,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2]

OCR uses the valley of this graph (or lowest value in the array) and sets it as a threshold that will differentiate between the two modes — so our OCR software ends up with two values that correspond to black and white respectively.

Sketched using this site: https://sketch.io/sketchpad/

What this does is filter out any ‘noise’ that’s surrounding our black text like shadows or dust, and reassigns anything else in the image to the color it most closely matches.

So what does our software do with this now definitively black text and white background? The words on your screen mean something to you and me because we can read it. But to a computer, these are still just pixels.

What happens next is the software analyzes characteristics about the text, sorts the text into character-shape-codes, and renders a confidence number for each letter — how confident the computer is about matching the text to a known letter. The letters that have the highest confidence numbers get compared to a dictionary of words and somehow a word that matches our collection of letters is chosen.

So what if there are other colors on the page? This is where multilevel or adaptive thresholding comes in — aka more math. This time our histograms have multiple peaks that we then render into a cleaner image, and do the whole shebang I just described.

Benefits of adaptive thresholding

What uses OCR besides software made for documents? Glad you asked!

Ibotta

There’s a great app named ibotta that pairs with grocery stores to give you rebates (FREE CASH!) on what you buy. The app uses OCR technology to read your receipt and validate that the rebate they offered matches what you bought — it also takes the date into account so put your old receipts back in that drawer you never use. Use my referral code rmevhkm and you’ll get a 10 dollar bonus when you sign up!

Google Translate

Google translate has an additional feature where you can take pictures of text and/or signs in one language and translate it into another. It does all of that good old OCR stuff I mentioned in addition to comparing word-shapes to a library of words in 27 different languages!


A super cool and maybe terrifying note from my classmate Cory Pavitt:

You know those (not annoying at all) captcha boxes you fill out when you have to convince a website that you’re not a robot?

OCR technology is getting better as time goes on so bots are actually able to transcribe and fill out those forms sometimes. Spooky!

#watchout

Thus concludes the end of the first edition of ‘I-Barely-Know-Anything-About-This-But-Here-Goes-Nothing’!

✌🏽