CPAR-Hindi Digit and Character Dataset

Hindi MNIST Dataset containing Digit and Characters is here

Gagandeep Singh
Sep 29 · 3 min read
CPAR Dataset

Where did this dataset come from?

CPAR dataset was collected by Rajiv Kumar and 2 other people. Lucky Rajiv Kumar joined as a faculty in my college in the same year when my under graduation was about to complete. He gave me this dataset and asked me to convert in in MNIST form and upload it. I’ve put a link to his paper at the bottom of the article.

How was this dataset collected?

According to Rajiv Kumar, it took almost 3 years to collect this dataset. He collected data from writers belonging to diverse population strata. They belonged to different age groups (from 6 to 77 years), genders, educational backgrounds (from 3rd grade to postgraduate levels), professions (software engineers, professors, students, accountants, housewives and retired persons), regions (Indian states: Bihar, Uttar Pradesh, Haryana, Punjab, National Capital Region (NCR), Madhya Pradesh, Karnataka, Kerala, Rajasthan, and countries: Nigeria, China and Nepal). All this was possible because he was a professor at a University which had students from all over India.

If you go through his paper you’ll see he created a form which has to be filled by hand. These forms were then processed using the software. More info about how he collected data is in his paper.

How does this dataset look?

The dataset consists of digits and characters.

Character Dataset
Digit Dataset

Let’s install and import to see

$ pip install cpar

Let’s look at character dataset first-

from cpar.char import load_data
train_x, test_y, train_y, test_y = load_data()

My first impression of this dataset was that it is tough. Why? Let’s look at a case

Few Variations of Aa
Few Variations of Aa
Few variations of Aa
Similar looking characters

Such a small level of changes can cause great difficulty in distinguishing them.

Now, let’s see the digit dataset

Devanagari digits
Devanagari digits
Devanagari digits

What are the challenges in handwriting recognization for the Hindi Language?

The Hindi Language is very complex as compared to English because of many variations of even a single character which we refer to as ‘matra’. Here are a few examples

homonyms in Hindi

Considering these small differences in words the task to correctly read them becomes very difficult.

Conclusion

Various research on Hand Writing Recognization for the Hindi Language is going on. After going through this dataset I felt what are the challenges. The Hindi Language is much more complex as compared to English. There are few words that can be represented in various forms which make it even more complex.

The GitHub repo for this project is https://github.com/gaganmanku96/CPAR

References

  1. A Benchmark Dataset for Devanagari Document Recognition Research

Gagandeep Singh

Written by

Data Scientist at Zykrr. Geeky — https://www.linkedin.com/in/gaganmanku96/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade