TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction From
Scanned Document Images

Devi Prasad
Analytics Vidhya
Published in
6 min readApr 20, 2021

Computer vision is the medium through which computers see and identify objects. The goal of computer vision is to make it possible for computers to analyze objects in images and videos, to solve different vision problems. Object Segmentation has paved way for convenient analysis of objects in images and videos, contributing immensely to different fields, such as medical, vision in self driving cars and background editing in images and videos.

In this blog i will discuss about a research paper TableNet: a deep learning model for both table detection and structure recognition from document images, by segmenting out table and column region.

Table of Contents

  1. Introduction
  2. Dataset Source
  3. Problem Statement
  4. Mapping to ML/DL Problem
  5. Dataset Preparation
  6. Data Preprocessing
  7. Model Development
  8. Table Data Extraction
  9. Deployment
  10. Future Work
  11. Profile
  12. References

1. Introduction

With an increasing number of mobile devices equipped with cameras, an increasing number of customers are uploading documents via these devices, making the need for information extraction from these images more pressing. Currently, these document images are often manually processed resulting in high labour costs and inefficient data processing times.

Most existing approaches to tabular information ex-traction divide the problem into the two separate sub-problems of table detection, table structure recognition, attempt to solve each sub-problem independently.

TableNet takes a single input image and produces two different semantically labelled output images for tables and columns. The model does it, by using pretrained VGG-19 as base network and then two decoder branches by using features extracted from VGG-19. One decoder branch is responsible for doing segmentation of table region and another branch is responsible for segmentation of column region. After detecting table and column region, the tabular data can be extracted using Tesseract OCR.

2. Dataset Source

TableNet model is trained on marmot dataset, which contains scanned document images and it’s corresponding XML file. XML file contains information regarding location of table and column regions in respective scanned document image.

3. Problem Statement

  1. Segment out table regions from an image, if any table like structure is present.
  2. Extract data from table.

4. Mapping to ML/DL Problem

For extracting table information from a given input image, we need to segment out table and column region from input image. We can consider scanned image as input and Table mask and Column mask as output. Thus we need to classify each pixel i.e if a pixel belongs to table or not. We can interpret this problem as classification problem(segmentation task).

Performance Metric F1-Score.
F1 score considers Precision and Recall equally, i.e False Negative and False Positive cases will be penalized equally.

5. Dataset Preparation

From the given marmot dataset, we have scanned document image in .BMP format and corresponding XML file. XML file contains co-ordinates of every columns present in an image.

Sample input image
Sample xml file for given image

XML file contains many elements like filename, path, size, object. filename defines the name of the corresponding image. Size denotes the size of the input image. Object gives column coordinates. Each object element is a column in input image.

We can see for every <object> we have <bndbox> element and inside <bndbox> we have xmin, ymin, xmax,ymax, which gives coordinates of column (xmin,ymin) (xmax,ymax). From given XML file we need to create table mask and column mask image. Below i have given code for extracting information form XML file in order to create Table and Column mask. While creating mask from XML file, the boxed region is filled with pixel value 255, and 0 if not a boxed region.

sample code for creating mask from XML file

For every input image we need to generate Table Mask and Column Mask. We are saving input image and 2 masks as JPEG file format. A pandas dataframe is created with 3 columns- orginal_image_path, table_mask_path, column_mask_path.

Sample Column Mask
Sample Table Mask

6. Data Preprocessing

The reseach paper suggests to resize input image to 1024*1024 dimension. An image is nothing but matrix. Our Input images are RGB images, therefor every pixel has 3 values (Red, Green, Blue). In order to train out TableNet model we need to load images to memory and then begin training process. But due to Memory constraint, all images can’t be loaded into memory at once. Therefor we need a DataLoader which will contain images in batch and pass defined batch of image to model for training.

Dataset Object is created from pandas series.

After creating dataset object for both train and test, we will read each and every image, saved table mask, column mask. Document images are resized to 1024*1024*3 and Masked images are resized to 1024*1024*1 i.e GRAY scale image. Each pixel value in document image can range in between 0 and 255. Each image is normalized by dividing 255.0, so that pixel values will be scaled in between 0–1.

The dataset objects are given a batch size, so that dataset object will emit image in given batch of images.

7. Model Development

TableNet Model consists of 3 parts mainly .
i. Encoder(VGG-19)
ii. Decoder (Table Mask Generator)
iii. Decoder (Column Mask Generator)

TableNet Model Architecture

The intuition behind TableNet model is to extract features from input image using pre-trained VGG-19 model and extracted feature is then processed through 2 decoder branches, to generate masked output. Encoder layer will downsample image and decoder layer will upsample image.

i. Encoder

Model takes input of dimension 1024*1024*3. Input image is then pass though pre-trained VGG-19 model with out fully connected layers, thus feature vector is generated, which will be passed to two decoder branches.

The downsampled input image is then processed though two 1x1 conv2D layer.

The intuition behind using (1x1) convolution is to reduce the dimensions of feature maps (channels) which is used in class prediction of pixels.

ii. Decoder (Table Mask)

The downsampled images after passing though two conv2D layers, again processed though one 1x1 conv2D layer. Then with help of skip-pooling technique the low-resolution feature maps of the decoder network combined with the high-resolution features of encoder networks. After upsampling we will get output table mask of shape (1024*1024*2). Output image has 2 channels because, we have 2 class labels (background, masked region). For predicting output value of a pixel value, we need to pick the class for which predicted probability is high.

iii. Decoder(Column mask)

Feature vector from 1x1 conv2D layer is passed to decoder (column mask), but unlike table decoder the input feature vector is processed through two 1*1 conv2D layer, then skip-pooling technique is used upsample image. The output of column decoder is 1024*1024*2.

8. Table Data Extraction

Input image of shape (1, 1024, 1024, 3) is passed to model, and model predicts col_mask and table_mask of shape (1, 1024, 1024, 2). We need to extract maximum probability class by using argmax. Then images are saved in PNG format. Then we extracted alpha channel from output mask so that we will add that alpha channel to input image to mask-out original image.

Masked-out image

After that we will process through Tesseract OCR in order to extract tabular data.

9. Deployment

Deployed model in local system using Flask. Code can be found here.

10. Future Work

a. Performance of model can be further increased, but due to lack of computational power we are limited.

11. Profile

Connect with me in LinkedIn. This is github repository for entire code.

12. Reference

  1. https://arxiv.org/pdf/2001.01469.pdf
  2. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
  3. https://www.tensorflow.org/tutorials/images/segmentation

--

--

Devi Prasad
Analytics Vidhya

An Aspiring Data Scientist| Areas of Interest: Machine Learning, Deep Learning, NLP, Computer Vision