OCR(Optical Character Recognition) From Scratch using Deep-learning.

Sanket Magodia
10 min readJan 16, 2022

Hey folks, ever wondered how Google, Instagram, etc. read your texts from your images? Ever wondered how computer understands what is written in the image and show you the results accordingly? Answer to that question is a term named Optical Character Recognition or OCR.

Optical Character Recognition- Optical character recognition or optical character reader is an electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image.

After reading this article you will be able to make your own OCR from literal scratch.

We will be feeding an image with some text and our script will be able to give us the prediction of texts it founds.

Prerequisites-

  • Basic Knowledge of Python and libraries(Numpy, pandas, skimage)
  • Basic Knowledge of Deep Learning

Note- I will not be providing the setup tutorial for the code, however full source code will be provided in the end. Reason behind this tutorial is to give you the Idea or the Algorithm behind developing an OCR and not to teach how to code.

results-

Input image-

Bounding boxes for showing how characters are detected-

Extracted text-

SO, LETs START..

FLOW DIAGRAM

Here’s the flow diagram for your reference if you got lost in the steps below.

Lets start with the dataset,

DATASET

So the dataset used is The Chars74k dataset with 64 classes (0–9, A-Z, a-z). I’ve chosen computer generated characters dataset for the sake of simplicity which has 62992 synthesized characters from computer fonts. Here’s the link.

Preprocess the Dataset and Train our Deep learning Model

As the images in dataset are already in binary(i.e pixel values are either 0-black or 255-white) we are not converting them into binary. So, in pre-processing, we will take each image and add some white margin to it and finally resize them to 32x32 pixel size to maintain resemblance, so that model can generalize better.

Note- Also while training I was not using gpu to train, that’s why I’ve trained the model with only 10–15 samples for every class. This is the reason I’ve not got the best accuracy and that why I’ve also not done split in the dataset for train-test and used whole dataset for training.

So lets start.

1- Imports

import numpy as np
from skimage import io
import os
from PIL import Image
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

2- Preprocessing and Dataset Preparation

This function add the white margin

def add_margin(pil_img, top, right, bottom, left, color):
width, height = pil_img.size
new_width = width + right + left
new_height = height + top + bottom
result = Image.new(pil_img.mode, (new_width, new_height), color)
result.paste(pil_img, (left, top))
return result

List named dataset will have the pixel values of all images in the dataset and List named labels will have number count where each number corresponds to a specific class. Also we will add some margin and resize all the images of dataset to size 32x32

dataset=[]
labels=[]
count=0
folders=os.listdir(r”S:/Directory/English/Fnt/”)
for i in folders:
for j in os.listdir(r”S:/Directory/English/Fnt/”+str(i)):
im_new=
add_margin(Image.open(r”S:/Directory/English/Fnt/”+str(i)+’/’+str(j)), 10, 10,10, 10, (255))
resized_image = im_new.resize((32,32))
dataset.append(np.array([ np.asarray(resized_image)/255.0 ] ))
labels.append(count)
count+=1

Finally reshape the dataset and labels list as required by the deep-learning model,

dataset = np.array(dataset).reshape('total number of samples',32,32,1)
labels = np.array(labels).reshape(-1,1)

We will convert the labels into OneHotEncoding (Know more about that)

from sklearn.preprocessing import OneHotEncoder
type_encoder = OneHotEncoder()
labels=type_encoder.fit_transform(labels).toarray()

3- Build the model

Input for model is 32x32 as we are feeding images of size 32x32 and output layer is of size 62 because we have our data in 62 classes.

model = Sequential()
# Convolution
model.add(Conv2D(32, (3, 3), input_shape = (32,32,1), activation = 'relu'))
# Pooling
model.add(MaxPooling2D(pool_size = (2, 2)))
# Convolution
model.add(Conv2D(32, (3, 3), activation = 'relu'))
# Flattening
model.add(Flatten())
# Full connection
#model.add(Dense(units = 512, activation = 'relu'))
#model.add(Dropout(0.5))
model.add(Dense(units = 256, activation = 'relu'))
# Add Dropout to prevent overfitting
model.add(Dropout(0.5))
model.add(Dense(units = 128, activation = 'relu'))
model.add(Dense(units = 62, activation = 'softmax'))
# Compiling the CNN
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.summary()

4- Train and save the model to use further

model.fit(dataset , labels, batch_size = 4, epochs = 100)model.save(‘Model.h5’)

After this you will find the Model.h5 file in the directory.

HURRAY!!! we did the machine learning part!

User inputs and Prediction

(In this I am just showing snippets, don’t worry I’ll provide the whole code)

1- Convert any color image to black and white( note- input image should consists of only two color).

We will take the edge pixel values and convert all the pixels matching that value to white(i.e 255) and pixel value other than edge pixel to black(i.e 0), to get our standard output.

logo=Image.open(UPLOAD_FOLDER+'/'+patho)# input image location
logo=ImageOps.grayscale(logo) # rgb to grayscale conversion
logo=np.asarray(logo) #pil image to array
# color to black and whitea=logo.copy()for i in range(len(logo)):
for j in range(len(logo[0,:])):
if logo[i][j]==logo[0][0]:
a[i][j]=255
else:
a[i][j]=0 # variable 'a' is the output

2- Convert grayscale image to binary

Note- There are more sophisticated methods to implement this

def black_and_white(a):# takes np array image
m=a.copy()
for i in range(len(m)):
for j in range(len(m[0])):
if m[i][j] >200:
m[i][j]=255
else:
m[i][j]=0
return m

3- Text detection

We will traverse the image line by line in horizontal axis and checkpoint whenever a line switches from all white pixels to line containing some black pixels along with white and vice versa. This algorithm helps us to detect the coordinates of text lines.

We will get outputs as 4 coordinate points of the rectangle

coords=[]
xycoords=[]
def line_coords(coords):
xmin=coords[0][0][0]
xmax=coords[-1][0][0]
ymin=20000
ymax=0
for i in coords:
for j in i:
if j[1] >ymax:
ymax=j[1]
if j[1] < ymin:
ymin=j[1]
xycoords.append([xmin,xmax+2,ymin+1,ymax])
for i in range(len(logo[1:-1,1:-1])):
coo=[]
flag=0
for c in logo[1:-1,1:-1][i]:
if c<200:
flag=1
if flag==1:
for b in range(len(logo[1:-1,1:-1][i])):
if logo[1:-1,1:-1][i][b]>200:
try:
if logo[1:-1,1:-1][i][b+1] <200:
coo.append([i,b+1])

except:
pass
if logo[1:-1,1:-1][i][b]<200:
try:
if logo[1:-1,1:-1][i][b+1] > 200:
coo.append([i,b])

except:
pass
else:
if len(coords)>0:
line_coords(coords)
coords=[]
if len(coo)>0:
coords.append(coo)

List xycoords contains list of coordinates of detected texts

Eg-

4- Space and Line Detection

Idea behind the space detection is, we will look through the above coordinates cutouts(this time vertically) and count the number of continuous vertical white lines and pass that array for k-means clustering. Which creates 2 clusters and make an array equivalent to the number of characters eg-[0,1,0,0,0,1] so we can understand there is space in the 2nd position. And after every line we can append 2 in the above array so we can have a checkpoint for line breakers for eg text - “I will love you/nForever” will have an encoding [0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0].

import ckwrap # library for k means
spaces=np.array([0])

ctr=0
for y in xycoords:
sp=[]
a=black_and_white(logo[y[0]:y[1],y[2]:y[3]])
for i in range(len(a[0,:])):
f=0
for j in a[:,i]:
if j==0:
f=1
if f!=1:
ctr+=1
if f==1:
sp.append(ctr)
ctr=0
nums= np.array([jj for jj in sp if jj!=0])
if len(nums)==0:
spaces=np.concatenate((spaces,np.array([2])), axis=None)
else:
print('nums are - '+str(nums))
#print(nums)
km = ckwrap.ckmeans(nums,2)
print('labs are - '+str(km.labels))
print('final are - '+str(finalXY))
spaces=np.concatenate((spaces,km.labels,np.array([2])), axis=None)

5- Dump Pieces of characters from detected lines

Now we will crop out images from the detected lines in the previous section and traverse pixel lines vertically to find the immediate line with all pixels equal to white(255).

I’ve made a function which returns a flag value whether a vertical line contains black pixel or not. And while traversing vertical lines one by one if flag changes, we consider it as end of character in the image.

As you can see in the image, red lines are the start and end X coordinates of the characters, so we can now further crop these pieces and save it to a folder with naming convention as integer from 0 to number of chars, to maintain the order.

col=[]
# dump pieces of characters
count=0
for hoe in finalXY:
newC=[0]
def flagCalc(i):
flag = 1
jo=0
for j in range(len(logo[hoe[0]:hoe[1],hoe[2]:hoe[3]][:,i])):
if logo[hoe[0]:hoe[1],hoe[2]:hoe[3]][:,i][j]<150:
flag=0
jo=j
#print(flag)
return flag
for i in range(len(logo[hoe[0]:hoe[1],hoe[2]:hoe[3]][0,:])):
try:
if flagCalc(i)<flagCalc(i+1):
newC.append(i+1)
except:
pass
newC.append(hoe[3])
col.append(newC)
for i in range(len(newC)-1):
A=black_and_white(logo[hoe[0]:hoe[1],hoe[2]:hoe[3]][:,newC[i]:newC[i+1]])
im = Image.fromarray(A)
im.save("dump/"+str(count)+".png")
count+=1

6- Preprocessing the pieces of characters before feeding the model

Now we got the pieces of characters, but one thing to notice, they are not consistent in the image. For little more clarity take this snip from previous example-

Here character ‘t’ is in the corner of the image and occupying very little space in the image and character ‘h’ is in the middle and occupying greater space. Also the aspect ratios of two croped pieces are different. which makes it difficult for the model to predict as you can recall the model is trained on the images which are on the center of white square of size 32x32. So our next task is to make these cutouts such that they will look close to our images of dataset.

Steps-

After removal of borders, character ‘t’ will look something like this-

Which will be pasted at the center of a 32x32 white image and same margins are added as we had done on the dataset.

def borderRemoval(path): 
a = io.imread(path)
#print(a)
for i in range(len(a)):
for j in range(len(a[0])):
if a[i][j] >200:
a[i][j]=255
else:
a[i][j]=0
def flagCalc(i):
flag = 0
for j in range(len(i)):
if i[j]==0:
flag=1
return flag
y1=0
y2=a.shape[0]
x1=0
x2=a.shape[1]
for i in range(len(a)-1):
#print(flagCalc(a[i]))
if flagCalc(a[i])<flagCalc(a[i+1]):
y2=a.shape[0]
if (i+1)< y2:
y1=i+1
elif flagCalc(a[i])>flagCalc(a[i+1]):
if (i-1)>y1:
y2=i-1
for i in range(len(a[0,:])-1):
#print(flagCalc(a[i]))
if flagCalc(a[:,i])<flagCalc(a[:,i+1]):
if (i+1)< x2:
x1=i+1
elif flagCalc(a[:,i])>flagCalc(a[:,i+1]):
if (i-1)>x1:
x2=i-1
im = Image.fromarray(a[y1:y2,x1:x2])
# print(y1,y2,x1,x2)
im.save(path)
def PasteImage(path):
a = io.imread(path)
if a.shape[0] > a.shape[1] :
f=28/a.shape[0]
else:
f=28/a.shape[1]
b=Image.fromarray(a,mode='L').resize((
int(a.shape[1]*f),int(a.shape[0]*f)),Image.BICUBIC)
c=Image.fromarray(np.full((32, 32), 255).astype('uint8'),mode='L')
img_w, img_h = b.size
bg_w, bg_h = c.size
offset = ((bg_w - img_w) // 2, (bg_h - img_h) // 2)
c.paste(b, offset)
c.save(path)
for mm in os.listdir(r"dump/"):
#print(r"dump/"+str(mm))
borderRemoval(r"dump/"+str(mm))
PasteImage(r"dump/"+str(mm))
black_and_white(r"dump/"+str(mm)) # we had implemented this function above

This will do the preprocessing of all the croped character images and save it there itself.

NOTE- we will add margin while predicting to save the conversion computation

7- Label creation

As our model will output the OneHotEncoded label which will look like [0,1,0,……] for prediction of class at location 2. Using Argmax function can output us the location of highest value in a list, in our case it will output ‘1’(list index starts from 0). Now, we know there are only 62 classes so we can create our label conversions manually.

labs={0: 0,1: 1, 2: 2, 3: 3,4: 4,5: 5,6: 6,7: 7,8: 8,9: 9,
10: 'A', 11: 'B', 12: 'C', 13: 'D', 14: 'E', 15: 'F', 16: 'G', 17: 'H', 18: 'I', 19: 'J',20: 'K', 21: 'L',22: 'M', 23: 'N', 24: 'O', 25: 'P',26: 'Q', 27: 'R', 28: 'S', 29: 'T', 30: 'U',31: 'V', 32: 'W',33: 'X', 34: 'Y',35: 'Z',36: 'a',37: 'b',38: 'c',39: 'd',40: 'e',41: 'f',42: 'g',43: 'h',44: 'i',45: 'j',46: 'k',47: 'l',48: 'm', 49: 'n',50: 'o',51: 'p', 52: 'q', 53: 'r', 54: 's', 55: 't', 56: 'u', 57: 'v', 58: 'w', 59: 'x', 60: 'y', 61: 'z'}

8- Model Prediction

Variable ‘STRINGS’ is our main variable where we are going to add our predictions, line breakers and spaces. So in this step we will go through our list named spaces, to check if there is a space or line breaker or a character. If there is character we will go to our dumped images and convert it to array, add same margin as we added in our dataset and pass it through our model. It will output the OneHotEncoded value which will be converted into location number using Argmax function which is eventually matched with the dictionary named ‘labs’ which we created manually in the previous step to get a predicted string which is appended to the variable ‘STRING’. and same goes to spaces (‘ ‘) and line breaker(‘<br>’ or ‘\n’). And after prediction is made, we clear all the dumped images from the folder.

def add_margin(pil_img, top, right, bottom, left, color):
width, height = pil_img.size
new_width = width + right + left
new_height = height + top + bottom
result = Image.new(pil_img.mode, (new_width, new_height), color)
result.paste(pil_img, (left, top))
return result
co=-1
STRING=’’
charPred=[]

while True:
co+=1
try:
if spaces[co]==1:
STRING=STRING+' '
elif spaces[co]==2:
STRING=STRING+'<br>'
image = Image.open('dump/'+str(co)+'.png')
#print(co)
im_new = add_margin(image, 10, 10,10, 10, (255))
resized_image = im_new.resize((32,32))
a=np.asarray(resized_image)/255
hehe=labs[np.argmax(model.predict([
a.reshape(32,32,1).tolist()]))]
charPred.append(hehe)
#print(hehe)
os.remove('dump/'+str(co)+'.png')
#image.save("imgo/"+str(co)+'_'+str(hehe)+ ".png")
#Image.fromarray((a*255).astype('uint8'), mode='L').save("imgi/"+str(co)+'_'+str(hehe)+ ".png")
#model.predict([a.tolist()])
STRING+=str(hehe)
except:
break

Variable STRING will have the required output.

Full source code with an Flask web app Implementation.

YAY we completed it!!!

My LinkedIn & Twitter

--

--