Table Detection Using Layout Parser

Shashank
iReadRx
Published in
3 min readSep 9, 2021

--

Table detection is arguably one of the most important features in any PDF analysis application, especially while analyzing patents. At iReadRx, we have experimented with various table detection models and libraries to meet our needs (effectively detect tables in chemical patents). This blog post summarizes all these experiments, what worked and what didn’t work, and future improvements.

Table Detection Approaches

Basic Image operations

Tables contain certain features like boundaries that are easily distinguishable from their surroundings.

OpenCV makes it easy to detect edges like these. Hucker Marius in this blog post demonstrates how to detect tables using this approach. He further uses OCR to extract the detected table cells’ text data and store it in a CSV file.

Deep Leaning

The previous approach has its obvious limitations. Some of them are: Not all tables have boundaries, and it doesn’t do a good job detecting tables in any general part of a page.

The task of detecting a table from an image (page) can be approached by treating it like an Object Detection problem. Table detection has several state-of-the-art models like Cascade Tab Net, Table Net, etc. Their pre-trained models were able to detect tables and table cells accurately.

We were pretty satisfied with these results UNTIL… I accidentally stumbled upon Layout Parser.

Layout Parser

Layout parser is perhaps one of the most underrated libraries when it comes to table detection.

After stumbling on layout parser, I realized it could do more than just Table Detection.

I chose the ‘faster_rcnn_R_50_FPN_3x’ model trained on the ‘PubLayNet’ dataset. So apart from detecting tables, this model could also detect Titles, Paragraphs, etc. This was basically an all-in-one solution to what we were looking for. We no longer have to build different models every time we needed to detect a new entity in a patent document. This would save us time and effort, especially during deployment, because different models require different dependencies. Finding the correct version of PyTorch and CUDA that would support all these models would be a nightmare.

Layout Parser uses Detectron2 at the back end, ensuring that we rely on the state-of-the-art.

My favorite part about layout parser, however, would be the ease of running inference.

Results

OCR limitations

Layout parser supports two OCR engines, tesseract, and Google Cloud Vision’s OCR engine. Both of them are very good at detecting and extracting the text present in the table. However, the tables in chemical patents contain more than just simple text.

And both these OCR engines can’t detect these chemical structures. So in tables like this, OCR tools aren’t very useful.

Because not everyone reading this blog is working with chemical patents, the OCR limitations can be ignored. So after detecting and cropping tables, use the OCR approach in Hucker Marius’s blog to extract text from your tables, or use Layout Parser’s OCR engine.

Colab Tutorial

--

--