Attending a Microsoft One Commercial Partner (OCP) Virtual Hackathon

Sumit Kumar
Version 1
Published in
6 min readMar 9, 2021
Photo by You X Ventures on Unsplash

Microsoft hosted a virtual hackathon from 2nd to 4th March 2021, which generated excellent ideas. The One Commercial Partner (OCP) Program has been accelerating growth by powering new markets and solutions to transform the way enterprises work. As an endeavor to innovate and lead the NEW, in collaboration with Microsoft, Acclaim launched the Microsoft OCP Program — a journey that promises loads of learning and cool rewards.

Credit: GlobalExec.

Version 1 — Innovation Team:

A team of 4 from the Version 1 Innovation Team participated in the Hackathon. The Innovation Team has been working on a Form Review and Input Tool with several of its government clients. The tool is used in the processing of applications consisting of numerous PDF/scanned forms. The main aim of attending the hackathon was to explore the possibilities for further post-processing and validation of the data before final review and submission.

Use Case: Smart Data Capture

We had chosen a common problem our clients face. They spend a huge amount of time and resources on manual data entry. Form recogniser does an excellent job in extracting printed text from Forms, but handwritten forms can sometimes be very tricky. Therefore we wanted to build a solution on top of Form Recognisers’ capabilities by automatically suggesting and correcting when it falls on processing handwritten text.

Cleaning up Form Recogniser output for Handwritten forms by highlighting suggested corrections and build validations (using vectors spaces).

The add-on feature could be used on fields such as:

  • Name (First & Last)
  • Address
  • County
  • Country
  • Email
  • Eircode

Text Auto Suggestion/Correction on poorly recognised data from Azure Form Recognizer

Azure Form Recognizer:

Form Recogniser is an AI-powered document extraction service that understands forms and extracts key-value pairs, tables, and text from documents such as W2 tax statements, completion reports, invoices, and purchase orders. Form Recogniser has the added support for handwritten and mixed-mode (printed and handwritten) and printed forms.

In our solution, we developed a Flask application that used the custom model of Form Recogniser to extract the aforementioned fields from the Scanned forms that were mostly handwritten.

To read more about setting up Azure Form Recognizer and building a custom model, read our previous blog.

Challenges:

While analysing the documents, we found that Form Recogniser had difficulties processing documents with poor handwriting or wasn’t scanned properly. A few characters weren’t recognised correctly.

Our Solution:

Add an extra layer of validations that would use predefined regex patterns (based on specific fields) to identify whether the data extracted is in a valid format or not. If any field values fail to validate, we send that data to our auto-correction model (explained below) to generate suggestions close to the actual answer.

Field Validation

For the Hackathon, we only focused on validating some the fields like MPRN, Eircode and RECI Number. We considered the following parameters for validating each field.

  1. Data Type: String, Date, Integer.
  2. Pattern: Regular Expression check, including allowed characters.
  3. Rule: Logic-based, e.g., number range, date range, maximum number etc.
  4. Checks: Internal or external lookups, e.g., Let us say we want to see if the recognized MPRN is valid or not. We need to look up some database or external service.

We created a python script that accepted the Form Recognizer results and validated them against the lookup JSON file, consisting of each field’s regex pattern. If the pattern matched the validations defined above (Rule, Data Type and Checks), we moved ahead to the next step in the workflow. If the Validation failed, we tried to autocorrect and suggest users with the potential options using Machine learning and Vector space algorithms.

Text Auto-Correction

To auto-correct fields, we found the similarity between the correct and incorrect words. E.g.

In the above table, the words on the right are incorrect, but they are closely related to the ones on the left. So, on visualising these words on a 2-D plane, the similarly spelled words are closer to each other, as shown below.

We developed a similar model that auto-corrected Irish Names that worked on the following three concepts.

  1. Word Embedding:

It is a vector representation of the words where each value in the vector has some weights. It’s a learned representation for text where words that have the same meaning have a similar representation. The benefit of using this technique is that they have low dimensional space and are dense (most values in the vector are non-zero).

The main objective was to have words with similar context and occupied close spatial positions. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e., an angle close to 0. Intuitively, we introduce some dependence of one word on the other words.

2. Word2Vec:

It is a method to construct those embeddings mentioned above. It can be obtained using two approaches, both involving Neural Networks.

  • Continuous Bag of Words (CBOW): This method takes each word’s context (single or multiple context) as the input and tries to predict the word corresponding to the context.
  • Skip Gram: We use the target word (whose representation we want to generate) to predict the context, by generating representations.

For more details on Word2Vec and its approaches read this link.

3. Char2Vec:

The Word2Vec model operates with a fixed vocabulary which usually don’t consider rare words or typos. To address this limitation, we needed word embeddings based only on their spelling and collated similar vectors to similarly spelt words.

The Char2Vec model represents each sequence of symbols of arbitrary length with a fixed-length vector; the distance metric between vectors represents the similarity in words spelling.

For the Hackathon, we created a custom Char2Vec model with Irish and English characters and trained the LSTM model. After training the model, we created a vector space of all the available names present. The model worked in such a way that the input test misspelt name was vectorised, and the cosine distance was calculated with the rest of the vector space. The word with the minimum cosine angle or highest similarity was returned.

Learnings and Conclusion

Overall, the Hackathon was a great experience, meeting ambitious developers, getting positive feedback on our strategy from Microsoft, and spreading the word about Version 1.

We also identified a few Azure Form Recognizer constraints and addressed the issues to Microsoft’s Product Team, and received a quick response from them. It would be worth considering future hacks they run and potential ideas as an opportunity to overcome any technical challenges. Microsoft provides great support for them and a much quicker turnaround and access to people (and product teams) than usual.

About the Authors

Sumit Kumar is currently a Data Scientist for Version 1’s Innovation Labs, who develops innovative solutions and proof of value for customers to ensure Version 1 remain on the forefront of disruptive technology.

Ankit Kumar is currently an Innovation Data Scientist Consultant, working within Version 1’s Innovation Labs.

--

--