An Introduction to Optical Character Recognition (OCR)

3 min readOct 21, 2022

My last note shared how models can learn from real-time data with online learning. This note shares how a system of machine learning models can be used to identify and classify text within an image in optical character recognition (OCR).

What OCR Does

OCR could take a picture of the handwritten words “hello world” and give the computer the actual text “hello world” instead of just a picture that the computer can’t understand.

System of Models

OCR uses a series of machine learning models to turn a picture with text in it into text characters that a computer can read and manipulate. It does this with a three step process, where each step uses different models to make different predictions:

Text Detection — Identifies where there is text in a photo.
Character Segmentation — Separates out the distinct characters in the identified text.
Character Recognition — Translates images of characters into their intended character.

Sliding Windows

Steps 1 & 2 make use of a concept called “sliding windows”. This is where “sliding window classifier” moves across the image to give a prediction of what is inside the window at each position.

Step 1: Text Detection

The “sliding window” moves across the image and at each position it predicts the probability that the window contains text. This uses a model that’s been trained to classify whether or not a picture includes text. This creates a grayscale representation of the image where the the portions of the image that most likely contain text are highlighted. Then the high-probability text areas are expanded outward to create larger contiguous groups of likely text. Then these groups are put into rectangles for step 2.

Step 2: Character Segmentation

The “sliding window” moves across each rectangle of likely text to predict whether or not each position is a character segmentation (i.e., space between two characters in a word). This uses a model that’s been trained to identify character segmentations. If it includes a segmentation, then y=1. If it includes a character or a blank space, then y=0.

Step 3: Character Recognition

Now that step 1 has identified the likely text in the image, and step 2 has separated each character, the next step in our system will use a classification model that has been trained to classify pictures of text as text.

This is how machine learning can be used to recognize text in an image and turn it into text that a computer can understand!

Up Next

The next note in this series will look at how you can assess the performance of different parts of a machine learning system using ceiling analysis.