Detecting Geometrical Shapes Using OpenCV + ConvNets

Published in

The Startup

5 min readAug 6, 2020

A simple yet powerful pipeline for detecting shapes in scanned documents

What is this about ?

One of the most rapidly growing sub fields in the domain of Artificial Intelligence is Natural language processing (NLP), it deals with the interactions between computers and human (natural) languages, in particular how to program computers to process and make sense of large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation among others. Out of these, information extraction problems such as NER (Named Entity Recognition) are fast becoming one of the cornerstone applications of NLP. In this post, I am going to share a solution for one of the trickiest problems that comes up while performing NER.

Why do we need a custom solution ?

Recent developments in Deep Learning has led to an explosion of sophisticated techniques that are available for entity extraction and other NLP related tasks. More often than not, enterprise grade OCR softwares (ABBY, ADLIB etc.) are used to transform massive volumes of unstructured and image-based documents into fully searchable PDF and PDF/A assets. Subsequently, one can use state of the art algorithms (BERT, ELMo etc.) to create highly contextual language models to infer the extracted information and achieve NLP objective.

In reality though, not all documents are comprised solely of language based data. A document can have lot of other non-linguistic elements such as radio buttons or a signature block or some other geometrical shape that may contain useful information but cannot be easily interpreted by either OCR or any of the aforementioned algorithms. So, there exists a need to design a specialized solution to identify and interpret such elements and that’s our Why.

An example of check boxes and radio buttons in a document

How do we do it ?

Now, this where the things get interesting. How do we perform extraction and identification of such elements from a scanned document ? To answer this, the author proposes a 3 step architecture that can be potentially used to detect any shape (a universal shape detector ? maybe). It’s a pretty straightforward approach and the one that promises a good accuracy.

Step 1: Convert the documents (pdfs etc.) to image files. Write a heuristics code based on OpenCV APIs to extract all potential image segments. This code should be optimized for coverage rather than accuracy.

Step 2: Label the images extracted in Step 1 accordingly. Create a CNN based Deep Learning network, and train it on the labelled images. This step will take care of the accuracy.

Step 3: Create a Sklearn pipeline, integrating both the above steps , so when a documents is ingested, extract all of the potential images and then subsequently use the trained CNN model to predict images of the desired shape.

Design Considerations

Its important that the OpenCV code is able to identify as many image segments of the desired shape as possible. Essentially, we need to have a wide detection range, and don’t worry about the false positives, they will be taken care by the subsequent ConvNet model. We also need to parameterize the classes/functions up to the brim, this will enable easy configuration for a variety of documents going forward. I have chosen CNN for image classification because its easy and quick to model but one can use any other algorithm of choice as long as performance and accuracy are within acceptable limits.

Pipelining plays a pivotal role in structuring ML code. It helps in streamlining the workflow and enforcing the order of step execution. Moreover, a production level code should always be piped.

Lets take 3 steps

Step #1: The OpenCV

This code serves dual purpose, 1) creating training/test data (when executed standalone) and 2) extracting image segments when integrated in the pipeline.

The extraction code can currently detect 2 types (Radio Button and Check-boxes) but additional objects can be easily supported by adding the new methods under the ShapeFinder class, below is the code snippet to identify squares/rectangles aka check-boxes. (go here to see the complete code base)

*Use pdf2image to convert the pdf to image. I have not included this in Git since my data was already in image format.

def Img2Pdf(dirname):
    
    images = []
    
    #get the pdf file
    for x in os.listdir(dirname):
        if (dirname.split('.')[1]) == 'pdf':
            pdf_filename = x
            images_from_path = convert_from_path(os.path.join(dirname),dpi=300, poppler_path = r'C:\Program Files (x86)\poppler-0.68.0_x86\poppler-0.68.0\bin')for image in images_from_path:
                images.append(np.array(image))
                
    return images

Now lets talk about the step #2 i.e. Convolutional Neural Network

Since the extracted image segments will have relatively small dimensions, a simple 3 layer CNN will do for us but we still need to throw in some regularization and an Adam to optimize the output.

The network should be trained separately on each type of image samples for better accuracy. You may create a new network in case a new image shape is added, but for now I have used the same for both checkbox and radio button. Its currently only a binary classification but further categorization can also be done like:

Ticked checkbox
Empty checkbox
Others

Finally in step #3 we will be stitching all the things in a single Sklearn pipeline and expose this through the predict function.

One important functionality that I have not covered is to associate the checkbox or radio button to their corresponding texts in the document. Just detecting elements without association is frankly useless in real world applications. I would leave this as an open challenge to you guys but think of it as a text proximity problem.

Final thoughts

‘One size doesn’t always fits all’ and this is specially true here, tend to think of this code as a kind of template. As-is, this code is not intended to work for everyone and that’s perfectly fine, but this approach will always work for the given documents/shapes provided some effort is put in to fine tune the parameters and create the training data.

Link to Git

Drop in your feedback in the comments !