Introduction on OCR

Note: This is simply a summary with many quotes directly from those articles.

Tesseract was created by HP and released for open source in late 2005. Since HP had independetly-developed page layout analysis, Tesseract did not include this process, and assumes its input is a binary image with optional polygonal text regions defined.

Stage 1: Connected Component Analysis

The first step Tesseract takes is a connected component analysis in which outlines of the components are stored. By inspecting the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text. Afterwards, the outlines are gathered together, purely by nesting, into Blobs.

Stage 2: Find Text Lines and Words

Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Text lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped immediatly by character cells. Proportional text is broken into words using definite spaces and fuzzy spaces.

Stage 3: Recognize Word Pass 1

In the first pass, an attempt is made to recognize each word in turn. Each word that is satsifactory is passed to an adaptive classifier as training data, which helps in recognized the text lower down the page.

Stage 4: Recognize Word Pass 2

Recognition is a two-pass process. Since the adaptive classifier may have learned something useful too late to make a contribution near the top of the page, a second pass is run over the page, in which words that were not recognized well enough are recognized again.

Stage 5: Post-processing

Final phase resolves fuzzy spaces, and checks alternative hypothesis for the x-height to locate small-cap text.

Now lets delve a bit deeper into these stages

Line and Word Finding

Fixed Pitch Detection and Chopping

Fixed pitch refers to fonts in which every character has the same width. The contrary, would be proportional fonts, in which different characters have different widths.

Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associator on those words for the word recognition step. See Fig 2. in first article.

Works Cited