Document Details Parsing using OCR

By — Sourabh Jajoria (Engineer, Partner Ecosystem)

Published in

Urban Company – Engineering

6 min readJan 3, 2020

Forms are an integral part of any on-boarding system. At UrbanClap, our partners are required to go through a strict background verification. Partners have to upload photos of the ID cards, fill in the details mentioned on the ID cards, fill the permanent address and the local address. Most of our partners come from a background with limited exposure to technology and are prone to make mistakes while filling details on the partner app. A lot of operation’s bandwidth was consumed in correction of filled details and validation of uploaded images. We reduced this bandwidth by incorporating OCR and document parsers in the flow. In this blog, we discuss how this change in the process improved user convenience, increased accuracy of details and reduced manual effort.

The following section demonstrates the old data entry flow and the new improved flow.

How has the data entry flow changed?

Previous Flow

Partners and operations team used to manually fill the forms.
ID images were validated by the central teams.

New flow

System checks for ID image correctness and rejects incorrect images.
System automatically populates information from the uploaded ID images.

The following section explains the automated details filling system.

How does automated details filling system work?

How does the system extract details from images?

System fetches the details from ID image using the following steps:

Extract raw text from document using OCR
Validate document based on raw text
Parse relevant information from raw text using document parser

What is OCR ?

OCR(Optical Character Recognition) is used to convert text present in images to machine-encoded format(Wikipedia Link). In our case the images are ID documents. The general steps involved in an OCR are:

Image pre-processing: This step includes techniques like image de-skewing, noise removal, binarization of image, line detection, character segmentation and scaling(Link).
Character classification: Machine learning algorithms are used here to classify a character based on the training set and the model.

We checked performance of some of the OCR solutions available on our dataset of PAN Card ID images. These solutions included proprietary and open-source solutions. The dataset had a good mix of high, medium, low quality images based on the sharpness, noise, exposure and size of the image.

The OCRs were tested on their ability to correctly detect the characters present in the name and the identification number. Given our use case we were expecting majorly of medium or low quality images in production hence good performance in those categories was important. More Information on our analysis can be found here.

OCR Performance

To summarise, Google Vision performed best in all cases even with low quality images where other models struggled. The pricing of vision also seemed reasonable as we expected less than 10,000 images per month. The expected image quality of ID documents clicked by partners was medium to low. So, we went forward with Vision as our first choice of OCR.

How does the system validate documents?

Generally document’s type checking is done by training a classification model over the expected document image set. This approach requires training the model on the images of all the documents to be classified. We went with a simpler approach of making use of text present in the document to check whether it’s valid or not. We implemented regex based validation for document text. For instance PAN Card has title of “INCOME TAX DEPARTMENT”, text with heading “PERMANENT ACCOUNT NUMBER”, text matching the PAN number format. This simple technique has helped us identifying invalid documents in production.

What is Document Parsing?

Every standard ID document has a defined format. The document title, field headings, field formats, photo position, barcode position, document number format and so on. We developed regex based rules to filter relevant text from the document. These rules were specific to a document type as most documents differ in format.

The common steps used to parse fields are:

Remove noise from text
Find field heading line numbers
Process field values based on heading line numbers

The complete document parsing process can be found here.

We were able to operate at an average response time of 7 secs for the automated details filling system. The following section explains how we reduced the average response time to 2 secs.

How did we make the system fast?

The response time of the automated details filling system was dependent on the image size being captured. For images with size greater than 6MBs, the response time used to increase significantly, in some cases even crossing 30 secs. The upload time from the partner app also used to be high for such cases. This was broken user experience and hence needed some fixing. We experimented with image compression from partner app and it worked.

We scaled the image to a fixed number of pixels (Link)
We compressed the image with maximum quality possible (Link).

With these efforts we were able to bring down the average response time of automated details filling system to 2 secs without affecting the accuracy of OCR. The following section presents the overall results of our new system.

How is the system performing in production?

Before integrating the details parsing system with the app, we ran an analysis to compare the accuracy of existing manual entries stored. The analysis was done on one month of data which had 432 PAN card images. We checked for name, date of birth, document number. We found that in 30% of cases the details stored in the database by manual entry didn’t match those present on the document image. This meant an effective accuracy of 70%. To add to that there were cases where correct document image wasn’t uploaded.

The automated details filling system has been able to improve correctness.

In production with over 1000 scans in past month, the automated details filling system has achieved an average response time of 2 secs.
In 142 cases, wrong documents were being uploaded and the system prevented all those attempts.
We’ve been able to fetch field values accurately in 94% of the cases. In the remaining cases, either the fields were not fetched or partners had to edit the fields.
The system has significantly reduced the manual effort required by partners and operations team to fill ID details.

Future Scope

More document parsers can be integrated to solve remaining on-boarding form filling use-cases. These documents can be Voter Card, Driving License, Bank Cheques, Debit Cards, etc.
The system works fine for single line fields but for multi-line fields like address on Aadhaar Card, the accuracy of the system is not upto the mark due to the noise introduced by multiple languages. This is still something we’re working on.
Document checking system needs to be made more robust to handle fraud cases.
To increase the accuracy of the system, we are working on building visual cues to guide partners in the process of capturing the images correctly.
We plan to release this system as an open-source library in the near future.

About the author
An engineer on weekdays, an artist by weekends, Sourabh works in the Partner Ecosystem Team and loves to use technology in creative ways to make our Partners’ lives easier.

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (UrbanClap Blogger) . Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

You can read up more about us on our publications —
https://medium.com/urbanclap-design
https://medium.com/urbanclap-engineering
https://medium.com/urbanclap-culture https://www.urbanclap.com/blog/humans-of-urbanclap

If you are interested in finding out about opportunities, visit us at http://careers.urbanclap.com