Tesseract OCR: Understanding the Contents of Documents, Beyond Their Text
A few days ago, I stumbled upon a question on Reddit. The commenter wanted to know if it was possible to extract text from a rough location on an image, so basically a screenshot or an image of a standard-form document. They also wanted to know if this could be achieved with Tesseract OCR (the image-to-text toolkit sponsored by Google).
The topic of the question is so close to two of my current projects — as well as the topic of my Master’s thesis — that I just had to write a decent answer.
In this write-up, I’ll introduce a demo use case, go through the use of Tesseract, introduce a few usable heuristics, and show a few code examples. After reading this post, you should be able to do some basic text extraction from images of simple documents (e.g. forms, driver’s licenses, etc.).
I have written the code examples in Python, but I try to keep it as simple as possible. It’s not supposed to be pretty or “best practice”— just understandable for everyone regardless of their favorite programming language. I have also used the standard output of Tesseract in our analysis. For real applications, you’d want to use a wrapper such as pytesseract (and its image_to_boxes() method).
Problem: Understanding the Layout of a Driver’s License
So, let’s imagine we’re building a mobile app that knows when the user’s driver’s license is scheduled to expire, and reminds the user to renew it when the time comes. The UI of our imaginary app is simple: the user takes a photo of their driver’s license, and the app does the rest!
How should we go about building this kind of an application?
Well, assuming that the layout of the document/screenshot/etc. in our photo is more or less static (as is the case with our driver’s licenses; let’s assume they are from one single country/state), the following three steps should be enough:
- Extract text and the position of the words and paragraphs using Tesseract OCR
- Apply some heuristics to get the data you need
- Build the rest of the app (the logic etc., outside the scope of this post)
Now, let’s go through each step in a little bit more detail.
Extracting Text and its Position with Tesseract OCR
Before starting, make sure you have Tesseract OCR 4 installed. As there are countless of installation guides for it online (e.g. this one for Windows 10), I won’t go through it here.
We’re also assuming that the language of the text in the image is English, as that’s installed by default.
So, we want to get the text from our image of a driver’s license. We’ll be using this image for the demo:
Tesseract has certain tolerances for what it sees as letters and what as background. So, you might want to play around and find the correct pre-processing steps for your input images. I had to up the brightness and contrast a little bit (resulting in Image 2) to get any decent results.
We’ll be using Tesseract OCR using its command line interface. Open your terminal (or for Windows, your command prompt), and type in the following:
tesseract -l eng FILENAME_OF_YOUR_IMAGE.jpg out tsv
What does this mean? Let’s break it into pieces:
- tesseract: Call for the Tesseract OCR application.
- -l eng: This tells Tesseract that you’re trying to detect English.
- FILENAME_OF_YOUR_IMAGE.jpg: Path to the image you’re trying to analyze. Replace this with path to your image.
- out: The filename of the output. If you wish to just output the results in the terminal, replace this with a dash (“-”).
- tsv: Output type, meaning tab-separated values. As opposed to just the default plaintext, this option gets you a nice file that contains all of the words and their pixel-positions in the image.
Please note! If you’re using Windows, you need to call the tesseract.exe executable. You’ll need to either add it to your PATH environment variable, or call the executable directly, so:
C:\some\path\tesseract.exe -l eng FILENAME_OF_YOUR_IMAGE.jpg out tsv
After pressing Enter, the output should look something like this:
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Estimating resolution as 231
Detected 24 diacritics
If you open the file out.tsv that’s been saved to your current working directory, its contents should look something like this, but less pretty:
“That looks spooky. What are all those numbers?! What the hell??”
Do not worry. Next, I’ll explain the contents as thoroughly as possible.
Each row in Tesseract’s TSV output represents a word, or a layout structure (such as a page or a paragraph), found within our input document.
Let’s start from the easiest and the most important part: the text. The rightmost column text contains a word recognized by Tesseract. If the row represents a layout structure, the value for this column will be empty.
The column conf contains — in percentages — Tesseract’s confidence level for the word being correctly recognized. The higher the percentage, the more likely it is that Tesseract got it right. E.g. the value “96” on row 20 (in Table 1 above) means that Tesseract is 96% sure the word “MALAYSIA” is correctly recognized. A confidence of “-1” would mean that the row doesn’t contain a word, but an aforementioned layout structure representation.
Columns left and top represent the x and y coordinates, respectively, of the word in the image. Columns width and height mean just that: the width and height of the word in the image. The units of all these values are pixels.
Then, let’s jump to the first column, level. Its value tells us whether the row in the output is…
- a page
- a block
- a paragraph
- a line
- a word.
Columns page_num, block_num, par_num, line_num and word_num identify the individual instances of these layout structures within the hierarchy of the document. Let’s go through these in a bit more detail.
- page_num is the page on which the word was found. Please note! In this demo, we’re not analyzing multi-page PDF documents, so our page_num only gets the value 1. If we had multiple pages in our document (i.e. we weren’t analyzing singular photos), we’d see higher indices here as well.
- block_num tells the detected layout block within the current page.
- par_num tells the paragraph number within the layout block.
- line_num tells the line number within the paragraph.
- word_num tells the word number within the line.
Below in Table 2, you can see the different rows within the output file (i.e. Table 1) highlighted with different colors, based on the “level” of the row:
In Table 2, there are three blocks (1–3), with one paragraph each (1.1, 2.1 and 3.1). Each paragraph contains just one line (1.1.1, 2.1.1 and 3.1.1). The first two paragraphs contain single words that contain just whitespace (126.96.36.199, 188.8.131.52), whereas the last paragraph contains the words “LESEN” (184.108.40.206), “MEMANDU” (220.127.116.11) and “Ee” (18.104.22.168).
Visualized on the input image, the blocks, paragraphs, lines and words would look like this:
As you can see, the layout structures identified by Tesseract are far from perfect. But hey, that’s the nature of OCR.
Back to our demo! Below are the rows that we’d be interested in for our analysis. In the demo, we’ll concentrate on the last row, as it contains the driver’s license’s expiration date — and that was the point of our app!
I hear you ask: “So, can we just take the input image, run Tesseract, and select the rows where page_num=1, block_num=7, par_num=2, and line_num=2? Could it be as simple as that?!”
Sadly, no. :(
As mentioned before, the layout structures identified by Tesseract aren’t perfect.
The block, paragraph, line, and even word numbers may change a lot, if the analyzed image changes ever so slightly. Different lighting, different aspect ratio, or any of the dozens of other alterations to the input image can change these values by a lot. So what’s page 1 and block 7 here, could be page 1 and block 3 if the image was changed even slightly.
Unless you’re analyzing a very simple black-on-white screenshots with absolutely static layout structure, you shouldn’t trust these numbers at all. Maybe not even then. They are, however, somewhat logical, so they can be useful in some instances. We’ll get to that in the next chapter.
So, what should we do, then? How would you go about extracting this one specific piece of text from this general area of the document? There are countless possibilites, but we’ll now go through a couple of them.
Understanding the Image
For these examples, I’ll be using Python. If you want to follow the examples, please make sure you have that installed as well.
It should be quite straightforward to install, so just head on to https://www.python.org/downloads/ and install it.
We’ll be using the Python packages re (for regex) and pandas.
Okay, so now we‘ve managed to get Tesseract’s output. Next, we need to build something that lets us take any image of a driver’s license, and extract the validity dates from it. Broadly speaking, there are two possible approches for this: using heuristics, or using machine learning.
Using heuristics, so basically rules-of-thumb, is what you need to do if you don’t have lots of training data available. It’s also the preferred option, if your data is (and hence, your model would be) simple enough to base rules on. In the absence of a large dataset, a simple rule-of-thumb can be a lot more robust than machine learning models.
Coming up with rules of thumb from your data is a lot more of an art than a science. You’re basically just making hypotheses from your data, and it’s up to your use case how much you need to test and validate them.
Machine learning models, on the other hand, can be a bit more context-specific. A model that works well for Malaysian driver’s licenses might by completely off-topic for e.g. analyzing screenshots of an app, so further discussion on ML models will be the premise for another blog post. Maybe by then I’ll come up with a better demo use case than driver’s licenses.
Regarding the use heuristics, we’ll now go through the following approaches:
- Position-based identification of text
- Trusting Tesseract’s hierarchy detection
With position-based identification of text I mean that we’re identifying the relevant words in the document based on their x and y coordinates relative to the image.
If we want to identify strings solely based on their position in the document, we need to be sure that the layout of the document doesn’t change. Depending on the rules we build, we can maybe let the aspect ratio or the resolution of the image change, but for any more than that, we need to use more hints from the data.
Now, let’s assume that the driver’s license photos are always somewhat neatly cropped (e.g. cropped manually by the app’s user, or with some simple edge detection algorithm). This way, we’ll be able to just go through each word that Tesseract has recognized, and
Below we can see the coordinates relevant to this exercise.
Based on these values, we can calculate the relative size and position of each word that was found in the image.
# Relative size of the word within the
relative_width = width_word / width_image
relative_height = height_word / height_image# Relative position of the word within the image
relative_x = x_word / width_image
relative_y = y_word / height_image
Here’s what this would look like as a Python script:
For our driver’s license data, we get the following output:
And that’s just the value we want! Nice!
This detection approach relies completely on the lists MIN_VALUES and MAX_VALUES. Their contents define what are the allowed ranges for our images’ X and Y coordinates, the the widths and the heights.
But where did those values come from?
In this case, they were manually tweaked to cover the rough area of the desired text in the image. So, if the layout of your images is super static, and you know how much the bit of text you’re trying to identify can move around between different images, then you can just guestimate these values.
But what if the position of the text we’re interested in changes a lot, but it’s always close to some known keyword? Like in this case, “Validity”?
Let’s change the aforementioned code a bit. Let’s assume we’ve gone through all the words outputted by Tesseract, and we’ve found the anchor word “Validity”. It’s X and Y coordinates are therefore known, and are represented below with x_anchor and y_anchor.
# Relative size of the word within the
relative_width = width_word / width_image
relative_height = height_word / height_image# Relative position of the word to the anchor word, within the image
relative_x_distance = (x_word - x_anchor) / width_image
relative_y_distance = (y_word - y_anchor) / height_image
With these changes, the code would look like this:
And after running the code, we get the familiar result we were waiting for:
These are the kinds of approaches you can take based on positional data.
Trusting Tesseract’s Layout Hierarchy
Okay. Remember when I mentioned that the level information outputted by Tesseract shouldn’t be trusted? Yeah, we’re addressing that now.
The final approach I want to cover assumes that Tesseract is able to categorize the anchor word (“Validity”) and the word-of-interest (the date) consistently into the same layout block. For some use cases, this approach might work better than the previous approaches.
In Table 4, we have the result for the following pandas query:
As we cannot trust the exact IDs of the layout structures, we need to, again, find our anchor word (“Validity”), and look for other words in the same paragraph. Then, we’ll iterate through the rows, and look for strings that look like dates (using regular expressions).
Our little script would look like this:
Running that, we should get the following output:
But what, that’s two dates?!
Well, we know the validity period of our driver’s license falls between these two dates, so we can safely assume that the expiration date is the latter. You know, ’cause it’s later.
And there, now you have the expiration date!
That was everything about understanding the layout of your images! Now you should be able to handle text that comes with positional metadata.
Next, we’ll briefly discuss a few topics you might want to research further before building your production application.
Where to Next?
This has been a very quick overview into the world of layout analysis, OCR, and machine learning. The demo application in itself is clearly not usable for like 99.7% of the true use cases out there, but it’s a good starting point for further exploration.
Here’s a couple things you might need to research a bit more, depending on your use case.
Detecting Image Manipulation
There are many aspects that need to be considered when building a system like this. If we were building our demo application for e.g. the identification of our online customers, and we were planning to rely solely on Tesseract OCR, our driver’s license authentication step could be bypassed in three minutes, with some arts and crafts on Microsoft Paint. Hell, our application wouldn’t think twice if we presented it with this:
But for some use cases that doesn’t matter. But if it does matter, then you might need to train a more complex model to recognize potential manipulation artefacts from the image.
If you have access to a lot of examples of valid images and manipulated images (even if manually manipulated), you might be able to train an image classifier, that would output the probability for the image being manipulated.
Document Layout Classification
In the case of driver’s licenses, you will absolutely have more different layouts than in our demo. Each country in the world has their own driver’s licenses, and for example in the US, every state has their own driver’s licenses! Our simple heuristics will never understand what’s going on when you apply them to a Florida driver license.
In essence, to be able to use our heuristics in a real-world setting, we would have to be able to know what layout is the image representing. Is it a Florida driver license, a Malaysian lesen memandu, Swedish körkort, or a Finnish ajokortti?
This information could be extremely important for other reasons as well. For example, how we parse and understand dates is dependent on the locale, and that depends on the country-of-origin of the driver’s license! So this is an important step for many use cases.
There’s no one correct method to identify the correct underlying layout. You could use user input, or some keywords found in the document (such as “LESEN MEMANDU” in our demo). Or, again, with enough data, you could train an image classifier that recognizes the correct layout for you.
If you got this far, you should now know how to:
- use Tesseract OCR to extract text from image-based documents
- interpret Tesseract’s outputs and understand the logic behind its layout structure
- build simple heuristics that allow you to analyse Tesseract’s outputs further.
Hopefully you got something out of this, and I hope you’ll be able to use some of this in your work. :) I’m also more than happy to give my opinion on some enterprise OCR applications and cloud services that tackle these same issues.
As mentioned earlier in the post, I’ve been building a couple of tools for basically this use case for a while now. If this is a topic you find yourself to be struggling with, please follow me here or message me on Twitter (https://twitter.com/waltteri_v/). Any criticism regarding this blog post is also more than welcome!
References, sources, etc. in a completely non-scientific form and order: