Smartly Detect & Classify Scanned Documents

Chirav Dave
Intel Student Ambassadors
4 min readAug 14, 2019

As a part of an ongoing project for my current summer internship, I’ve been working on scanned documents detection and classification. The goal of this project is to detect different documents present in any given scanned image and classify each of the detected documents to their corresponding classes. The documents comprise of multiple forms, driving license, insurance card and drawings. I can’t exactly show you an example of these but they look like typical scanned documents. Here’s a similar example:

Think about them as text inside 2 insurance cards and 2 driving licenses

A common approach to solving problems like this is to train a deep neural network typically a Convolutional Network as we are dealing with images. DNNs have proved to be tremendously successful for a variety of computer vision applications, including image classification, object detection, and face detection. However, DNNs are quite expensive, both in terms of computation time and memory usage. Therefore, they may not be practical for every case. We can also think of adapting transfer learning but then it only works when the data you want to train on is very similar to the data that the network was trained on. However, if you see the above image and the documents that I want to detect are not rich in features as compared to images in ImageNet and so we can actually use simple CV algorithms to perform the task.

I decided to develop a customized computer vision algorithm that relies on a series of well-studied fundamental components, rather than the “black box” of machine learning algorithms such as DNNs. My first observation was that these documents are usually rectangular-shaped in physical space with text inside them while the rest of the image is blank or has some noise. Therefore, the goal turned out to create groups of text and draw a rectangle around them that would serve as bounding boxes. Here’s the outline of my detection algorithm, as shown below. I will discuss each component in more detail next.

Document Detection Pipeline
  1. Edge Detection: Firstly, I convert an input image into gray scale, apply bilateral filer which is highly effective in noise removal while keeping the edges sharp and finally apply the canny edge detector to the image. This produces white pixels wherever there’s an edge in the original image. It yields something like this:
Edge Detection

2. Bounding Box: In order to form multiple groups of text (one for every document) and draw bounding boxes around them, I perform dilation after edge detection which increases the object area and is a great way to accentuate features and then I try to create white-colored blobs around the documents and the rest of the image parts are turned into black, this helps with the final step where I perform contour detection to detect rectangles around the documents. It yields something like this:

White-Colored Blobs & Bounding Boxes around the documents

3. OCR with Rotations: In this stage, I perform OCR with rotations to detect text from each of the detected documents. For OCR, I use Tesseract, which is a remarkably long-lived open source project developed over the past 20+ years at HP and Google. Moreover, as you can see that these documents are not necessarily well aligned, I make them horizontal by performing rotations so that OCR can yield sensible text.

4. Bag of Words: Once I get the text from OCR, I transform it into features using Sklearn’s Count Vectorizer as machine learning models only understand 0’s and 1's.

5. ML Classifier: Finally, I use Sklearn’s Random Forest model to train on the features that I get from the bag of words model and once the model is trained, I use it to classify new images.

In conclusion, I found this to be a surprisingly tricky problem, but I’m happy with the solution I worked on. There’s still work to be done on improving the bounding box detection for which I have to tune my algorithm that generates white-colored blobs. I will soon release my code so that people can get some benefit from it and even I can get some help in improving it.

--

--

Chirav Dave
Intel Student Ambassadors

Engineer, Innovative, Ambitious, Persistent, Self-Starter