Optical Character Recognition Using TensorFlow

Published in

Analytics Vidhya

8 min readAug 8, 2021

In this article we’ll be learning how to build OCR(Optical character recognition system using TensorFlow) and we’ll also deploy the deep learning model onto flask framework.

Table of Content

What is OCR?
Data collection
Building OCR Model
Model Deployment
Adding more data
Limitation
Future Extensions

Let’s get started by introducing OCR.

1. What is OCR?

Standard definition of OCR from Wikipedia

Optical character recognition or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image.

Wait isn’t it more technical?

In simple terms OCR is the system that recognise text from images, scanned documents. It is simple as that.

We’ll know that data in everything in Deep Learning. So, let’s find some datasets for solving this problem.

2. Data Collection

We’re building a character based OCR model in this article. For that we’ll be using 2 datasets.

The Standard MNIST 0–9 dataset by LECun et al.
The Kaggle A-Z dataset by Sachin Patel.

The Standard MNIST dataset is already builtin in many deep learning frameworks like tensorflow, Pytorch, keras. MNIST dataset allow us to recognize the digits 0–9. Each image containing single digits of 28 x 28 grayscale images.

MNIST doesn’t include A-Z so for that we’re using dataset released by Sachin Patel on kaggle. This dataset takes the capital letters A-Z from NIST Special Database 19 and rescales them to be 28 x 28 grayscale pixels to be in the same format as our MNIST data.

Here is an image example of image present in this datasets.

Now let’s directly jump into coding part.

3. Building OCR Model

Loading Datasets

Let’s build code for loading mnist dataset.

Each line of code in above code is self explanatory do let’s go further.

Now we need one more function to load A-Z dataset.

Again this block of code is easy to understand with the help of comments.

let’s call the function and our dataset is ready.

2. Combining datasets and dataset preparation

Now we need to combine both the datasets for feeding into model. This can be done with few lines of code.

Here we’re adding 10 to each label in a-z dataset because we are going to stack them up with mnist dataset. Then we are stacking data and labels then our model architecture needs images into 32 x 32 pixels so we’ve to resize it further we are adding a channel dimension to every image and scale the pixel intensities from [0–255] down to [0–1]

Now we have to convert labels from integer to vector for ease in model fitting and see the count the weights of each character in the dataset and also count the classweights for each label.

3. Performing Data Augmentation

We can improve the results of our ResNet classifier by augmenting the input data for training using ImageDataGenerator. We are using various scaling rotations, scaling the size, horizontal translations, vertical translations, and tilts in the images. Here is a block of code through which we perform data augmentation.

Now our data is ready so let’s build the heart of our Project i.e ResNet architecture.

4. Building ResNet Architecture

Here is a custom implementation of resnet architecture.I’m not explaining entire architecture in this post.

5. Compiling model

let’s initialise certain hyper-parameters for fitting our model.

EPOCHS = 50
INIT_LR = 1e-1
BS = 128

So we’ll fit the model with 50 epochs with initial learning rate of 1e-1 with batch size 128.

we are using stochastic gradient descent optimiser for fitting our model with categorical cross-entropy loss and we’ll evaluate our model on the basis of accuracy. So, finally let’s fit the model.

After about 3 hours of training on google colab with GPU I got 0.9679 accuracy at training set and 0.9573 accuracy at test set.

6. Model Evaluation

Training history

Above Graph looks pretty good which is the sign that our model is performing well on this task.

Let’s save this model so that we can load it afterwards.

model.save('OCR_Resnet.h5',save_format=".h5")

Before jumping into model deployment let’s check how our model is performing with actual images.

Here is a block of code that randomly get some images from the test set and predict them with visualisation.

Take a look at output which I’ve obtained.

4. Model Deployment

At the end, we want our model to be available for the end-users so that they can make use of it. Model deployment is one of the last stages of our project. For that we used python web framework flask to deploy our model into web application.

Wait what is flask?

Flask is a web application framework written in Python. It has multiple modules that make it easier for a web developer to write applications without having to worry about the details like protocol management, thread management, etc.

Flask gives is a variety of choices for developing web applications and it gives us the necessary tools and libraries that allow us to build a web application.

In order to build successful flask web-app first of all we have to create simple website using HTML5, CSS3 and Javascript. We have converted the model which is in the form of a python object into a character stream using pickling. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.

Next part was to make an API which receives user’s image though website and compute the predicted output based on our model.

Below is the flow diagram of entire application.

Let’s now build the algorithm in order to successfully deploy our model into web app.

How this thing can be implemented

Our model is trained in such a way that it recognize one character at a time i.e. It is a character based model. But in real life user can upload image containing entire words or even sentences. So, in-order to perform precise predictions we need to separate each and every character from the image and feed it to the model to get predictions.
Example

From above image it is clear to be clear that what exactly we need to do.
Once we got a list of extracted characters we can then resize each of them and feed it one by one onto the model and get predictions.
Our web app will then display the output.

Below is the entire code which perform this task.

You can get entire source code for this project at my github repository.

Now let’s look at final web app.

After clicking prediction button app.py file will run and prediction will be shown in screen.

let’s see performance in one more image.

5. Adding more data

From prediction made by our model it is to be seen that our model is still very poor in recognising the characters. So, we need to do something about it.

We’ve done tone of experiments to improve performance of this project I’ve cover all of them.

Our model does not know that a-z small alphabets are also exists so we need to add that data into our training dataset. Apart from this we can even add more data of other characters as well.

So in search of it we found a dataset containing all thing we needed over here. So what we’ve done is we combine all the 3 datasets i.e

Standard MNIST
A-Z from kaggle
English char 74k dataset

By combining all of this datasets our dataset became vast and also a-z characters were added. After that we perform all the same step as shown above and fit the model and got 85% accuracy on test set and models performance is also increased.

We can still increase the models performance by fitting for more epochs.

6. Limitations

Our model can fail if the image is very complex. E.g. cursive writing images or images with continuous characters.
Currently our model is trained on only english language and digits. So, if a user uploads an image of some other language then it given wrong predictions.

We can solve this limitations by expanding this project.

7. Further Extension

In order to overcome the limitations we can experiment with other neural network architectures and also combination of CNN and RNN i.e RCNN for prediction of continuous characters.
We can also train on some larger dataset for increasing performance.
For language other than english we can train our model with other language dataset.
We can also experiment with word based OCR technique which may be more effective than character based OCR.

This work is our internship work performed in at Bhaskaracharya Institute For Space Applications and Geo-Informatics in a team of myself, Prince Ajudiya and Yagnik Bhavishi.

I hope you got lot of useful information from this article.

Thanks for reading. 😃