Never underestimate Data Collection!

Varishu Pant
Analytics Vidhya
Published in
4 min readJul 9, 2020

Hey there fellow data scientists! This is a fun little story about my first end-to-end capstone project and how it made me realize the importance of a step we usually ignore, collecting data. Happy reading!

The story begins

This is where it all began…

Let me start by giving you a little context. This was the start of 2020, before the world knew what was coming (Psst… the pandemic), and I was a student at Praxis Business School, Bangalore. My team and I had just started out on what was going to be our capstone project for the final term. After going back and forth on ideas, and multiple brainstorming sessions with different faculties, we decided on a Computer Vision project where the objective would be to convert scanned handwritten forms, for example a bank form, or in our case, Student Admission forms, into an excel sheet that contains all the information(for example, Name of Student, Contact, Address, etc.)

An example form would be something like this (but filled up)-

Student Admission Form

Since libraries like Tesseract do not work for handwritten text, we had the task to create our own workflow, starting from, you guessed it, data collection. Yes, we have data stores like NIST which contain images of letters from A-Z and digits from 0–9 but our project also required special characters like ‘!’ , ‘@’ and so on and so forth.

https://www.nist.gov/data

So we decided to create our own data, which we thought would help in proving the concept to us because we’ll be able to try it on a small scale.

Challenges

Now we needed different handwritings to train a good AI architecture, so we took the help of our fellow Praxis students. The first challenge arises-

To create a standard framework , minimizing confusion and mistakes, hence making it easier during scanning and actually collecting digital images in the system.

With the help of our Professor Gourab Nath, we came up with this:

Data Collection Form Template

Standardized forms that were made up of 20 rows, each containing 31 blocks. The information to be filled and the order that it was to be filled in was fixed beforehand. To be precise, the first 4 rows spanned the following sentences-

‘THE QUICK BROWN FOX JUMPS OVER LAZY DOG/”’*&():;.,@-0123456789

THE FIVE BOXING WIZARDS JUMP QUICKLY /”’*&():;.,@-0123456789’

These were so chosen as they encompass the 51 classes that we were targeting. These 4 rows are repeated 5 times over in the whole form.

Empty templates with the instructions to fill them were handed out to our helpful friends and in the end, we had 120 forms, with unique handwritings.

Time to scan them and take out each box as a separate character image. The second challenge arises-

Tilt in the forms!

Broad white margin given to handle tilt upto 5 degrees

If the actual form had tilted while printing out the template, or if we, while scanning accidentally tilt the filled up form, detecting contours and taking out the images was a nightmare. Our initial response was to redo the whole collection part and make sure nothing is tilted this time around BUT we soon realized that if this is going to be a product, it needs to be resistant to a bit of tilt because we cannot expect everyone in the world to hold their phone cameras perfectly parallel to the document!

In order to do that, we used data augmentation techniques like shear and rotation while training, and to handle tilt while collection, a broad white margin on all 4 sides was incorporated in the standard form template. Also, during character extraction, contour sorting according to pixel coordinates was performed with some error margin. Character extraction had multiple steps, but that’s for another story .After all that, we got this-

Character Images Post Extraction

Conclusion

Did you know the standard dpi for all OCR related projects is 300dpi? Scanning in any less creates a whole lot of problems and we learned that the hard way! Anyway, I could go on and on…

I want to conclude with this: we have all looked at that flow chart of an ideal end-to-end Data Science project, it goes something like this-

But I think it would be more educational and informative if we add the percentage of time taken in each step!

Data Acquistion and Understanding may as well be the toughest part of the project. Thanks for reading!

Here’s the code files for the project on my Github-

--

--

Varishu Pant
Analytics Vidhya

Data Scientist|Statistician|Praxite|Lyricist|L&T FS