(A piece of) Machine Learning revolution in practice

mimyllyv
KelaLab
Published in
5 min readMar 28, 2022
An image resembling a starry dark blue night sky where the stars are interconnected by several curved white lines. These are partly hidden under circles of various shades of gray, somewhat like raindrops on a camera lens.

Preface

Once upon a time, there was a coder who stated that “I can beat the neural network by using computer vision on a problem…”. This post is a (true) story of this classification attempt and it tells what the challenges were during this event and what the outcome was.

The challenge

The need for this kind of classification emerged during the development of Document Computer Vision Reader (DocCVR). This piece of software is able to take a photograph of a document as input and then produce the data fields of the document in a structured format as output. This is achieved by using the OpenCV’s feature which matches and then crops the fields out of the image. This way it is possible to “read” i.e. translate the image of a field into text using Tesseract.

The DocCVR needs to convert an image to text, among other things. This includes a person’s id, dates, names, etc. But there are also other trickier challenges, such as converting checkboxes to boolean values. This means that when a field type is a checkbox, the value is true if the checkbox is checked, false if the checkbox is not checked and null, if there is no checkbox in the image.

Preparation

We extracted about 4500 checkboxes from different documents using DocCVR. One of the first things you need to do is to label the checkbox images. This sounds like a walk in the park, but because the feature matching algorithm is not able to find the best keypoints every time, OpenCV’s warpPerspective operation is not always able to straighten the image. This can be due to using the wrong template image or due to the flaws or compression artifacts in the photo.

The rule we came up with is that an image represents a checkbox if at least two sides of it are in the image and no other checkbox can be found in the image (if multiple checkboxes were visible).

So, now we have a dataset of checkboxes:

  • 1759 images where a checkbox is not checked
  • 948 images where a checkbox is checked
  • 1948 images where it is not possible to definitely say if there is a checkbox

Classifier 1: feature-engineering approach

The first version of the checkbox detector was based on simple rules on counting the number of pixels in an image. This attempt gave us about 70 % success rate, which was far too low for our needs.

The next attempt was to create the checkbox detector using OpenCV GoodFeaturesToTrack (GFTT). The GFTT algorithm can find angles from the image quite well, and with these angles it is possible to calculate possible checkbox areas. As we can see in the image below where the green contour represents the checkbox area. This algorithm includes a few other steps which try to reject images that cannot be interpreted as checkboxes, and then select the most obvious contour, if multiple checkboxes are found.

Detected checkbox from image (green color)

With this algorithm, the results are:

While this is not actually a bad result, there are still improvements needed for the not_cb class. So the next attempt is to take use of OpenCV’s LineSegmentDetector and write an algorithm for that. This time we are detecting lines from the image, and again, calculate possible checkbox contours. In the image below, the selected contour is in magenta color.

Detected checkbox from image (magenta color)

Combining these two algorithms, the results were:

This algorithm was a lot better in not interpreting images with no checkbox to checkbox. The execution time of this algorithm is about 5 ms per image. Not bad again, but maybe the result can still be better. For comparison, the algorithm below was run on the same test set as using a fine-tuned ResNet:

Classifier 2: pre-trained neural network approach

Then it was time to try ResNet18 and use transfer learning with our dataset. We have an in-house-created framework to estimate how different hyperparameters affect the outcome, so it was fairly simple to run several iterations to determine which parameters to use in the final training. First, the dataset was split into training and test datasets. The training dataset has about 2900 and the test dataset has 931 images of checkboxes.

And after dozens of iterations the results look like this:

Now the results are good enough for our purposes, even though they could still be better. The execution time for one prediction was about 50 ms, which is not a problem in our case. When checking the results manually, it was quite obvious that labeling is hard. Some of the “wrong” predictions could be “corrected” by altering the labels, but it depends on who the judge is.

Conclusion

The first thought, at least in my mind, was that OpenCV will give lots of good algorithms to solve this problem (of labeling checkbox images) effectively; and it does. In a perfect world this would be the way to go and it would take minimal computer resources to label these images. However, in the real world the images are anything but clean. There are flaws like the lighting or contrast being sometimes terrible and/or there are shadows in images. Therefore the main problem in this case for the computer vision is to solve this problem with the thresholding needed by the algorithms. Neither Otsu nor adaptive thresholding algorithms can solve every situation.

And then there is the the time needed to write the necessary code. It took about two weeks to implement the logic for the OpenCV algorithms versus few days to get these results from the Deep Neural Network (DNN). Mostly because of the previously implemented ai-dnn-framework.

So in the real world it seems that for the DNN, it is easier to adapt to the different problems in the input material. Next time faced with a similar problem, it might be feasible to first try the DNN journey…

Thanks

To AiRo team for the data

To Business Unit for the labeling

To Mika Juuti for AI expertice and ai-dnn-framework

To Adrian Rosebrock @ pyimagesearch.com for sharing knowledge on OpenCV

--

--