Extracting structured data from images? …OpenCV and ML model comes to the rescue!!!

Malgorzata Sebastiampillai
DataPebbles
Published in
5 min readAug 23, 2021

We like structured data and especially in tabular format.

“In a tabular presentation, data is arranged in columns and rows, and the positioning of data makes comprehension and understanding of data more accessible. Statistical and logical conclusions are derived from its presentation."

It definitely makes sense for our applications to also understand this tabular format and use that relational data to create more simplified but usable data points. These data points then would help us to automate the process of informed decision making.

All around us in the physical world we are surrounded by data and in a fair majority of the cases, we find data in a tabular format. However, the translation of data in the physical world to the digital one is often a time consuming and tedious process. Therefore, I would like to present a simple way to automate this process.

Can we use open source libraries only?

There are quite a few open source libraries on the market but by far the most common ones used are OpenCV and Tesseract.

OpenCV is an open source computer vision library with a modular structure. It has several different features, for example image, video processing, object tracking and detection-to name a few.
More information about OpenCV can be found here.

Despite OpenCV having a lot of native functionality, it misses a key part of the equation, being optical character recognition (OCR).Tesseract is a fairly mature optical character recognition engine, and the library which we shall be talking about throughout this prose. To delve in to the details of the Tesseract library please click here.

Let’s jump into the solution.

Tesseract is used for text detection from a digital source. I came to the conclusion after multiple trials with handwritten text, that the accuracy of text detection was fairly low. Therefore, i wanted to see its efficacy in conjunction with OpenCV on a scanned digital document. The main advantage of using Tesseract is the possibility of using its various different configuration parameters, such as OCR engine modes, which may may have different performance parameters, as illustrated below.

Picture1. OCR engine modes.

In addition to the OCR engine mode, we can also select the page segmentation mode. The page segmentation mode changes how the image is split, either by lines of text or words. Choosing the right one is made by multiple trials and comparing it with the desired outcome. Some modes require more computing power and therefore might slow down the process of recognition.

Picture 2. Page segmentation modes.

For the purposes of this article we will be looking at how we can apply OpenCV and tesseract for the example table below.

Picture 3. Original image

Firstly we will transform original table image into grayscale image. This will be our foundation for image segmentation . But to be able to go to this conversion we need to perform some operations where first the original image is transformed into a RGB colour image afterwards into gray colour image:

rgbImage = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
grayImage = cv2.cvtColor(rgbImage, cv2.COLOR_BGR2GRAY)

Vertical and horizontal kernels will help to detect vertical and horizontal lines in the image. This is necessary for next step which is segmentation.

Morphological operations like Erosion and Dilation will help to detect lines from the image. In other words Erosion and Dilation apply a structuring element to the input image and generate output image.

erode() Erodes an image using structuring element.
It computes a local minimum over the area of given kernel.

dilate() — Dilates an image by using structuring element.
It computes a local maximum over the area of given kernel.

More information about Morphological Operation can be found here.

#vertical lines
img_template1 = cv2.erode(img_bin, vertical_kernel, iterations=2)
vertical_lines_image = cv2.dilate(img_template1, vertical_kernel, iterations=3)
#horizontal lines
img_template2 = cv2.erode(img_bin, horizontal_kernel, iteration=2)
horizontal_lines_image = cv2.dilate(img_template2, horizontal_kernel, iterations = 3)

Next step after converting image to grayscale and applying morphological operations is to use binary and Otsu thresholds to get the binary image.

Binary threshold — Applies same threshold value for every pixel. If the pixel value is smaller than the threshold, it is set to 0, otherwise it is set to a maximum value.

Otsu threshold — Avoids having to choose a value and determines it automatically.

More informations about image thresholding can be found here.

(thresh, img_bin) = cv2.threshold(grayImage, 128, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)
Picture 4. Binary image

Next step in our journey is to find contours on the binary image and we are ready to crop the table into desired boxes.

OpenCV already has a function to do a contour detection where for better accuracy input image is a binary image:

contours, hierarchy = cv2.findContours( binaryImage, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE )
Picture5. Find contours function.
Picture 6. Data table with segmented boxes

At this stage we are ready to use Tesseract to extract all information from the boxes.

# Cropped box with data
new_img = img[y:y + h, x:x + w]
image_tess_ocr = pytesseract.image_to_string(
new_img, config='--psm 6')
tabledata.append(CellData(x, y, image_tess_ocr.strip()))

As you can see above pytessarect configuration mode used in this case is psm=6 which means that Tesseract assume a single uniform block of text in each box.

Picture 7. Animation of cropped boxes with data

With practice on this tabular data it was apparent that Tesseract had difficulty to read the data. When the cropped contoured box was close to the text the outcome was not accurate with the image. Also changing the segmentation mode had an impact on the result.

Picture 8. Ready to work data

Conclusion

In summary, the OpenCV and Tesseract are great open source libraries for text detection and recognition from scanned documents. It is an enormous field within the broader discipline of machine learning where the constraints and best use cases are still being explored.

What’s next?

What if the document has handwritten content and digital content?
This is a different challenge, we will need more machine learning models than Tesseract to solve this problem.
To find out more stay tuned to our DataPebbles publication.

--

--