Published in

ProgrammersClub

5 min readJan 27, 2018

Training a Deep Learning Model on Handwritten characters using Keras

This is the first part of two part series on training your own model on handwritten characters. This part focuses on generating your own character dataset. In the next part, we will discuss, as how to create and train our own model on the dataset generated in this part and test it on our own handwritten sentences.

It’s been ages since, I had been searching for labelled handwritten character data sets. However, there aren’t much labelled handwritten character datasets publicly available. Some of the available datasets are either obtained from Natural Images, too coarse to be able to use it, or charged for the same.

The Chars74K image dataset - Character Recognition in Natural ImagesCharacter recognition is a classic pattern recognition problem for which researchers have worked since the early days…
www.ee.surrey.ac.uk

Index of /ml/machine-learning-databases/uji-penchars/version1Edit
archive.ics.uci.edu

Arabic Handwritten Characters DatasetArabic Handwritten Characters Data-set
www.kaggle.com

ICFHR 2014 Competition on Handwritten Digit String Recognition in Challenging DatasetsHandwriting recognition is an open research topic in the document analysis community. A particular application is the…
www.orand.cl

One day, while surfing the web, I came across https://www.crowdflower.com/data-for-everyone/ site which had a csv file containing 4 columns (unit_id, image_url, trasncription ,first_or_last) and 1,30,000 rows. The image_url contained url to images and the transcription column contained the word within that image.

Some of the initial rows in the csv file. The second column contains the link to the image and the third column contains the name present in the image.

Downloading Images

The first step was to read the csv file and extract each url from it, and then download images from those url. This could be easily done by using the Pandas and Requests library in Python.

Pandas allows us to read the CSV file into a Dataframe and navigate through each row and column using iloc and loc functions. We then use get function in requests library to browse the url and iter_context &write functions to store the received data into a file.

import pandas as pd
import requestsdataframe = pd.read_csv('first_and_last_names.csv')
dir = 'downloaded_images/'
img_links = dataframe.iloc[:,:]
img_numbers = list(range(0, 129000))

for i,url in enumerate(img_links['image_url']):
    print('Downloading Image\t{0}'.format(img_numbers[i]))
    image=requests.get(url, stream=True)
    if image.status_code==200:
        with open(dir+'image_'+str(img_numbers[i])+'.jpg','wb')as f:
            for chunk in image.iter_content(1024):
                f.write(chunk)

Downloaded Image

Cropping Images

As you can notice, there is still some unwanted section in the above image (leftmost corner). We would like to get rid of it. On investigation, I found that the unwanted section (RENOM with some number) occupied only (75,36) in the original image of size (388, 36). We could simply load the image in a 2D list and crop out the wanted section and rewrite the image.

import cv2
import globi=0
base_directory = './downloaded_images/'
total_files=len(glob.glob('./downloaded_images/*.jpg'))for i in range(total_files):
    try:
        img = cv2.imread(base_directory+'image_'+str(i)+'.jpg')
        img = img[0:36, 74:388]
        cv2.imwrite('./cropped_images/img_'+str(i)+'.jpg', img)
    except:
        print('Image {0} not downloaded'.format(i))
        continue

Cropped Image

Segmenting characters from Images

The next task is to segment each character from the image. In the above example, ALICE will be segmented into 5 different characters (‘A’, ‘L’, ‘I’, ‘C’, ‘E’). For this purpose, we(I and Jawad Shaikh) used the histogram as base. We created the Density bins using the Vertical Histogram. Wherever, there is a character present in the Image, there would be a spike and wherever there are blankspaces, there would be a valley in the histogram. This would help us find the characters present in the image vertically. The same can then be applied horizontally. The intersection of the above two individual (Horizontal & Vertical) histograms would detect the presence of character in the image.

The code will be uploaded soon.

Notice the vertical and horizontal histogram around the image. The intersection of both of them is used to detect individual characters in the entire word.

Letter A

Letter L

Letter I

Letter C

Letter E

Matching between Segmented Characters and Transcription

Now, that we have segregated characters, we need to match them with the transcription column (name) present in the csv file. There are chances, that the segmentation might not work properly, or there might be a mismatch between the Name given in the file and the corresponding Downloaded image i.e. say we get 4 characters on segmentation, but the Transcription column contains 5 lettered name. So, we need to handle those. Also, we need to store the characters to their corresponding directory.

import pandas as pd
import glob
import cv2characters = []
dataframe = pd.read_csv('./first_and_last_names.csv')
dataframe = dataframe.dropna(subset=['transcription'])
total_names = dataframe.iloc[0:total_files, :-1]
for i, name in enumerate(total_names['transcription']):
    characters.append(list(name))# all_digit_rects contains the coordinates for individual character
# word is the corresponding name present in the fileif len(word) == len(all_digit_rects):
   for idx, digit in enumerate(all_digit_rects):
       current_character = img[all_digit_rects[idx][0]:all_digit_rects[idx][1], all_digit_rects[idx][2]:all_digit_rects[idx][3]]
                if not os.path.exists(word[idx]):
                    os.makedirs(word[idx])
                cv2.imwrite('./' + str(word[idx])+'/' + str(i) +'.jpg', current_character)

This way, I had 26 directories for all the alphabets (A-Z), with the images for that particular character extracted from all the images.

Directory A containing images of character ‘A’ segmented from all the downloaded images

Directories A-Z containing images corresponding to each character

In the next section, we will

Create our own model, test on our own handwritten images.
Use transfer learning using VGG16 and achieve over 95% accuracy on test and validation data.

Thank you, for reading.

Training a Deep Learning Model on Handwritten characters using Keras

The Chars74K image dataset - Character Recognition in Natural Images

Character recognition is a classic pattern recognition problem for which researchers have worked since the early days…

Index of /ml/machine-learning-databases/uji-penchars/version1

Edit

Arabic Handwritten Characters Dataset

Arabic Handwritten Characters Data-set

ICFHR 2014 Competition on Handwritten Digit String Recognition in Challenging Datasets

Handwriting recognition is an open research topic in the document analysis community. A particular application is the…

Downloading Images

Cropping Images

Segmenting characters from Images

Matching between Segmented Characters and Transcription

Written by Aditya Mishra