Devanagari Handwritten Digits Classifier — fast.ai

8 min readApr 17, 2019

abhik jha (@abhikjha) | Twitter

The latest Tweets from abhik jha (@abhikjha): "I just published Predicting Sensex https://t.co/aAJIoE8JK1"

twitter.com

Pre-Cursor

After finishing Deeplearning.ai specialization few weeks ago, I decided to step up my quest of knowledge in the field of Deep Learning and in that process joined Fast.ai (Practical Deep Learning for Coders — Part 1). For those who know about this know that this is one of the best courses available to learn the concepts and practical skills of Deep Learning. After huge success of this course, Jeremy Howard and Rachel Thomas who are the founders of fast.ai decided to launch a brand new version of the course for year 2019.

Jeremy has an unique teaching style which is top-down approach. This basically means that in the first lesson, we need to just code without thinking the complex math behind what we are doing. In subsequent lessons, he slowly opens up the codes and shows the real theory behind them. I found his teaching style amazingly refreshing and really look forward to complete both parts by 1st half of this year. A big shout-out to both of them for their amazing work for the community of data-scientists.

This article is inspired by the 1st lesson (I am planning to write more articles which are inspired by 1st lesson and yet to come, so stay tuned) which teaches Convolutional Neural Networks (CNN).

The Problem Statement

The problem which we are looking to solve here is an interesting one. We will create a CNN model which will try to predict the handwritten digits of Devanagari script. You can consider this as desi version of MNIST. However, its a bit difficult problem than MNIST, mainly because of two reasons:

Number of classes are almost double (26 vs 47)
Few characters in Devanagari script are so similar to each other, sometimes humans find it difficult to classify. For example, see at these two characters (you may notice one dot after the character in second image and that’s the only difference):

this is called “da”

this is called “ kna”

Data Source

The dataset contains mixed categories for Devanagari numerals (10 classes) and consonants (36 classes). Data set is explicitly separated into train and test set. Train set contains total 78,200 samples with 1700 samples per class for total 46 classes and test set contains total 13,800 samples with 300 samples per class for total 46 classes.

Data-set is collected from the school level students in Nepal and can be found here

Software

We needed GPU for running the model and since the data was readily available on Kaggle, I used Kaggle Kernels to run the model. Other choices available were Google Colab, Paperspace, AWS etc.
fast.ai library which is a wrapper around Pytorch, an open source deep learning framework by Facebook.
and of course Python

Codes and Workings

After loading the data, below code will transform the data — in this step, with two lines of codes, we will do image augmentation, cropping, padding and other things, prepare the data and normalize the data.

tfms = get_transforms(do_flip=False)data = ImageDataBunch.from_folder(path, ds_tfms=tfms, train=’train’, valid=’test’, size=32, bs=32).normalize(imagenet_stats)

Model architecture — we will use ResNet50 which is a pretty powerful CNN model trained on ImageNet dataset. In this exercise, we will use the method of Transfer Learning to train our model. This essentially means that the ResNet50 model which was trained on ImageNet dataset, we will download the trained weights and use those weights to solve our problem. The underlying assumption here is that ImageNet dataset is so huge, any model trained on this would have learnt so much that it can predict any other unseen image with great accuracy.

learn = cnn_learner(data, models.resnet50, metrics=accuracy, model_dir=”/tmp/model/”)#4 here is number of epochs
learn.fit_one_cycle(4)

Great thing with fast.ai library is that code writing is completely hassle free and in few lines of codes, we can build world class models.

Here, you can see that instead of “fit”, we are using “fit_one_cycle” which is a recent development (and cool thing is its already implemented in fast.ai library) which is inspired by Leslie Smith’s paper. Using “fit_one_cycle” results in far better results and train faster than usual “fit” method. As per his experiments, he recommends to use one cycle of learning rate with 2 steps of equal length (max learning rate can be found by the fast.ai library’s learning rate finder (this we will see later) and min learning rate is usually set at 1/10th of max learning rate). In first step of this one cycle, we increase the learning rate to maximum and after reaching maximum, come back to the minimum. The motivation behind this is that, during the middle of learning when learning rate is higher, the learning rate works as regularization method and keep network from overfitting. They will prevent the model to land in a steep area of the loss function, preferring to find a minimum that is flatter.

He further added that while learning rate is going from min to max and max to min in 2 steps, if we tweak momentum to go from max to min and then min to max, results are far better. The intuition here is that when learning rate is at its peak, and if momentum is slow, SGD can travel into different directions of finding better flatter areas.

More details on this can be found at an amazing blog.

Results of this learning are as follows-

In just 4 epochs, we achieved over 95% accuracy and it took just 16 minutes

We will now see how the model has performed by plotting confusion matrix, we will also see few top losses and one of the most handy tools in fast.ai library “most_confused” which basically tells us where the model found the images most confusing while training:

interp = ClassificationInterpretation.from_learner(learn)losses,idxs = interp.top_losses()len(data.valid_ds)==len(losses)==len(idxs)interp.plot_top_losses(9, figsize=(10,10))

Top 9 losses (some of them even we cant read)

Let’s see confusion matrix:

interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

Pretty neat! Most of our predictions are correct on diagonal

Now, lets see where the model was most confused:

interp.most_confused(min_val=2)[('27', '23', 20),
 ('14', '22', 16),
 ('13', '28', 12),
 ('28', '26', 12),
 ('42', '27', 12),
 ('33', '34', 10),

One of these confused pictures are as follows — 27 and 23 (indeed confusing, isn’t it; in 27th letter which is called “da” the small curly part in the lower part of image should pass through the overall character boundary whilst in the 23rd letter which is called “dha”, the curly part should stop at the boundary):

Another one — 28 and 13 (this one is as well; in 28th letter which is called “dhaa” the line on top should not touch the left curly part which looks like mirror image of 3 whilst in the 13th letter which is called “ghaa”, the line should touch the left curly part):

Till now, we were using the model which was trained on ImageNet dataset. Can we make the model better by learning on our own dataset?

Yes, we can and for this, first we will unfreeze our model (which means we will open up all the layers of our ResNet50 model to be trained and new weights to be learnt). Remember, we were using frozen model till now(all layers were frozen except the last one where instead of 1000 classes of ImageNet, we changed it to 47 classes).

One of the important aspects of finalizing the model is to find a good learning rate. To do this, we have a good feature in fast.ai library called “lr_find” (learning rate finder). We all know that finding a good learning rate is extremely crucial. A low learning rate will make the process of training very slow and a very high learning rate may result in complete divergence from the global minimum.

fast.ai library’s lr_find (learning rate finder) does a very interesting thing:

Over an epoch begin your SGD with a very low learning rate (like 10^-8) but change it (by multiplying it by a certain factor for instance) at each mini-batch until it reaches a very high value (like 1 or 10). Record the loss each time at each iteration and once you’re finished, plot those losses against the learning rate. Then we find two one suitable range of learning rates (the area of plot where we find the maximum downward slope and just before slope turning upwards). This plot looks something like this:

Here, min learning rate can be 1e-6 and max learning rate can be 1e-4. After 1e-4, we can see that slope is upward turning

This basically means that although we have unfrozen all our layers in CNN model, in initial layers where we learn basic features of an image, our learning rate will be smaller(so we retain most part of ImageNet weights as basic features of any image wont be too different) and as we progress towards later layers, we gradually increase the learning rates and learn more features of our own image dataset.

Finally, we fit (fit_one_cycle) the model again and see whether our accuracy has improved after unfreezing the model and finding a range of good learning rate-

learn.fit_one_cycle(5, max_lr=slice(1e-6,1e-4))preds,y, loss = learn.get_preds(with_loss=True)
# get accuracy
acc = accuracy(preds, y)
print('The accuracy is {0} %.'.format(acc*100))

The accuracy is 98.99 %

Simply awesome!

In just few lines of codes, we are able to train a world class CNN model which predicts 47 classes with almost 99% accuracy.

I hope you find this project interesting. My kaggle kernel can be found at: