CPAR-Hindi Digit and Character Dataset
Hindi MNIST Dataset containing Digit and Characters is here
Where did this dataset come from?
CPAR dataset was collected by Rajiv Kumar and 2 other people. Lucky Rajiv Kumar joined as a faculty in my college in the same year when my under graduation was about to complete. He gave me this dataset and asked me to convert in in MNIST form and upload it. I’ve put a link to his paper at the bottom of the article.
How was this dataset collected?
According to Rajiv Kumar, it took almost 3 years to collect this dataset. He collected data from writers belonging to diverse population strata. They belonged to different age groups (from 6 to 77 years), genders, educational backgrounds (from 3rd grade to postgraduate levels), professions (software engineers, professors, students, accountants, housewives and retired persons), regions (Indian states: Bihar, Uttar Pradesh, Haryana, Punjab, National Capital Region (NCR), Madhya Pradesh, Karnataka, Kerala, Rajasthan, and countries: Nigeria, China and Nepal). All this was possible because he was a professor at a University which had students from all over India.
If you go through his paper you’ll see he created a form which has to be filled by hand. These forms were then processed using the software. More info about how he collected data is in his paper.
How does this dataset look?
The dataset consists of digits and characters.
Let’s install and import to see
$ pip install cpar
Let’s look at character dataset first-
from cpar.char import load_data
train_x, test_y, train_y, test_y = load_data()
My first impression of this dataset was that it is tough. Why? Let’s look at a case
Such a small level of changes can cause great difficulty in distinguishing them.
Now, let’s see the digit dataset
What are the challenges in handwriting recognization for the Hindi Language?
The Hindi Language is very complex as compared to English because of many variations of even a single character which we refer to as ‘matra’. Here are a few examples
Considering these small differences in words the task to correctly read them becomes very difficult.
Various research on Hand Writing Recognization for the Hindi Language is going on. After going through this dataset I felt what are the challenges. The Hindi Language is much more complex as compared to English. There are few words that can be represented in various forms which make it even more complex.
The GitHub repo for this project is https://github.com/gaganmanku96/CPAR