Text extraction for unstructured data using easy OCR.

Rohini Vaidya
CodeX
Published in
6 min readJun 10, 2022

What is OCR ?

Optical Character Recognition is referred as text recognition. An OCR program extracts and repurposes data from scanned documents, camera images and image-only pdfs. OCR software singles out letters on the image, puts them into words and then puts the words into sentences, thus enabling access to and editing of the original content. It also eliminates the need for manual data entry. You can go through this blog to read more about OCR technology.

Applications of OCR:

Text extraction from the image can be used in number of applications such as reading information from scanned documents, social media data analysis, in healthcare sector, data scientists analyze the text using advanced data science techniques, bank uses the text extraction to compare the bank statements and there are many more applications of the text extraction. You can go through this blog to read more real time use cases of OCR.

Online tools available for text extraction:

  • Google Cloud vision API: It is the most powerful API available for the text extraction. It uses machine learning to understand your images with industry-leading prediction accuracy. You can try on some images using this link.
  • Amazon Textract: Amazon Textract enables you to detect and analyze text in single or multipage input documents. You can refer this link to get more insights about Amazon Textract.

Different techniques for text extraction:

  • Tesseract
  • Easy OCR
  • Keras OCR

Text extraction for unstructured data:

All of the above methods of text extraction will perform very well for structured data. But, for unstructured data these methods are not able to extract text in a proper manner. As, all of these methods scanned a document or an image horizontally line by line.

For such cases, I have implemented a technique which includes deep learning model along with the easy OCR model.

Problem statement:

Here is the sample image for text extraction of an unstructured data,

sample.png

From this image, I want to extract a list of all room names with it’s proper dimensions.

Solution:

Step1: Used object detection model (yolov4) and crop the detected images which contains only textual part from the floor plan.

Step2: Used easy OCR to detect the text from the cropped images.

Step3: Mapping room names with it’s corresponding dimensions.

Implementation:

Below is the code for detection. Here, I have trained yolov4 model on different text. So, for detection it will give me cropped images which will contain text part only.

net = cv2.dnn.readNet("custom-yolov4-detector.weights",
"custom-yolov4-detector.cfg")
#Reading class names
classes = [0]
layer_names = net.getLayerNames()
output_layers = [layer_names[i- 1] for i in net.getUnconnectedOutLayers()]
colors = np.random.uniform(0, 255, size=(len(classes), 3))
#print(len(classes))
img= IMG
#print(img.shape)
height, width,_ = img.shape
blob = cv2.dnn.blobFromImage(img, 1 / 255.0, (416, 416),swapRB=True, crop=False)
# give images input to darknet module
net.setInput(blob)
# Get the output layers of model
out_names= net.getUnconnectedOutLayersNames()
layerOutputs = net.forward(out_names)
# Generating detection in the form of bounding box, confidence and class id
boxes=[]
confidences=[]
class_ids=[]
qq = 0
j = 0
for out in layerOutputs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
centerX= int(detection[0]* width)
centerY= int(detection[1]* height)
w= int(detection[2]* width*1.3)
#print(w)
h= int(detection[3]* height*1.3)
#print(h)
x = int(centerX - (w/ 2))
y = int(centerY - (h/ 2))
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)

indexes= cv2.dnn.NMSBoxes(boxes, confidences, 0.3, 0.4)
indexes= np.array(indexes)
font= cv2.FONT_HERSHEY_PLAIN
colors= np.random.uniform(0, 255, size=(len(boxes),3))
Finaldf = []
for i in indexes.flatten():
x, y, w, h= boxes[i]
label= str(classes[class_ids[i]])
color= colors[i]
confidence= str(round(confidences[i],2))
if label == '0':
cv2.rectangle(img, (x, y), (x+w,y+h),color, 2)
cv2.putText(img, str(qq), (x+10, y+10), font, 2, (0, 0, 255), 2)
#print([x, y, w, h])
obj= img[y: y+h, x: x+w]
img1 = np.asarray(obj)
#print(type(img1))
cv2.imwrite("PATH TO SAVE CROPPED IMAGES".format(j), obj)
cropped_img.png

Now, We will check the angle of the text in an image.

Here is the function to detect an angle of a text in a cropped image:

def detect_angle(image):
mask = np.zeros(image.shape, dtype=np.uint8)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
adaptive = cv2.adaptiveThreshold(blur,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,15,4)

cnts = cv2.findContours(adaptive, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

for c in cnts:
area = cv2.contourArea(c)
if area < 45000 and area > 20:
cv2.drawContours(mask, [c], -1, (255,255,255), -1)
mask = cv2.cvtColor(mask, cv2.COLOR_BGR2GRAY)
h, w = mask.shape

# Horizontal
if w > h:
left = mask[0:h, 0:0+w//2]
right = mask[0:h, w//2:]
left_pixels = cv2.countNonZero(left)
#print("L",left_pixels)
right_pixels = cv2.countNonZero(right)
#print("R",right_pixels)
if left_pixels >= right_pixels:
return 0
else:
return 180

# Vertical
else:
top = mask[0:h//2, 0:w]
bottom = mask[h//2:, 0:w]
top_pixels = cv2.countNonZero(top)
#print("T",top_pixels)
bottom_pixels = cv2.countNonZero(bottom)
#print("B",bottom_pixels)
if top_pixels >= bottom_pixels:
return 270
else:
return 90

Sometimes, the text in an image is rotated, in such a case easy OCR will not be able to detect the text properly. For such a case I have used an algorithm which will rotate the text in an image to 0 degree.

Here is the function which will adjust the angle of a text if it is not at an angle of 0 degree.

#Adjust the angle
def rotation(angle):
if angle < -45:
angle = -(90+ angle)
else:
angle = -angle
return angle

After getting a proper image, now we can apply easy OCR to detect the text from it.

def ocr(rotated_img):
img = rotated_img
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext(img)
return result

Output of the OCR is given below:

[([[25, 11], [77, 11], [77, 25], [25, 25]], 'KITCHEN', 0.916220680138048), ([[17, 23], [85, 23], [85, 37], [17, 37]], '9\'7" x 12\'5"', 0.36220390889597476)]

Now, I used just a simple python code which will access the every second last element in a list and map the room name with its corresponding dimensions.

And, finally my output data-frame is as given below:

Output data frame

References:

  1. https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/
  2. https://stackoverflow.com/questions/58010660/detect-image-orientation-angle-based-on-text-direction

I hope this content will help you to solve any type of text extraction problem.

Thank you !!

--

--

Rohini Vaidya
CodeX
Writer for

Software developer | Machine learning | Data science | Computer vision | Artificial Intelligence