Extract Tabular Data from PDF in Spark OCR

Mykola Melnyk
spark-nlp
Published in
4 min readJul 21, 2021

Introduction to Table Extraction

The amount of data collected is increasing every day with many applications, tools, and online platforms booming in the current digital age. To make sense of, manage, and access this enormous data quickly and productively, it’s necessary to use effective information extraction tools. One of the sub-areas that’s demanding attention in the Information Extraction field is the fetching and accessing of data from tabular forms.

If you have lots of paperwork and documents where you have tables and you would like to manipulate data, you could copy them manually (onto paper) or load them into excel sheets. However, with table extraction, you can send tables as pictures to the computer than it extracts all the information and puts them automatically into a new document. This can save a lot a great amount of time and with fewer errors.

For organizations, this is a huge benefit because the tables are used frequently to represent data in a clean format. There are a lot of organizations that have to deal with millions of tables every day. To save time and automate these laborious tasks of doing everything manually, we need to resort to faster and precise tools such as Spar OCR, which can quickly extract tabular data from PDF.

We have written before about Table Detection & Extraction in Spark OCR and in this post we cover more detail extracting tabular data from the PDF.

Spark OCR can work with searchable and scanned(image) PDF files.

1. Start Spark session with Spark OCR

import os
from sparkocr import start
os.environ["AWS_ACCESS_KEY_ID"] = AWS_ACCESS_KEY_ID
os.environ["AWS_SECRET_ACCESS_KEY"] = AWS_SECRET_ACCESS_KEY
os.environ['JSL_OCR_LICENSE'] = "license"
spark = start(secret=secret, nlp_version="3.1.1")

During start the Spark session start function display the following info:

Spark version: 3.0.2
Spark NLP version: 3.0.1
Spark OCR version: 3.5.0

In order to run the code, you will need a Spark OCR license, for which a 30-day free trial is available here.

2. Read PDF document

For example, we will process a PDF file with the Budget Provisions table. Let’s read it as binaryFile to the data frame and display content using display_pdf util function:

from sparkocr.transformers import *
from sparkocr.utils display_pdf
pdf_df = spark.read.format("binaryFile").load(pdf_path)
display_pdf(pdf_df)

3. Define Spark OCR Pipeline

To convert each page of PDF to the image we can use PdfToImage transformer. It is designed for processing small and big pdfs (up to a few thousand pages). It supports the following features:

  • Splitting big documents to the small pdfs for effective utilize cluster resources when processing big documents. So we can distribute processing one big document to all cluster’s nodes if need. It supports few splitting strategies: SplittingStrategy.FIXED_NUMBER_OF_PARTITIONS and SplittingStrategy.FIXED_SIZE_OF_PARTITION.
  • Repartition data frame after splitting to avoid skew in the data frame. This prevent situation when one task of the job has significant big processing time. So resources can be utilized more effectively.
  • Binarization of the images as soon as possible for reduce memory usage and speedup processing.
  • Repartition data frame after extracting image for each page. This additionally prevents skew in data frame.

Whole pipeline for table detection and extraction:

# Convert pdf to image
pdf_to_image = PdfToImage()
# Detect tables on the page using pretrained model
# It can be finetuned for have more accurate results for more specific documents

table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr")
table_detector.setInputCol("image")
table_detector.setOutputCol("region")
# Draw detected region's with table to the page
draw_regions = ImageDrawRegions()
draw_regions.setInputCol("image")
draw_regions.setInputRegionsCol("region")
draw_regions.setOutputCol("image_with_regions")
# Extract table regions to separate images
splitter = ImageSplitRegions()
splitter.setInputCol("image")
splitter.setInputRegionsCol("region")
splitter.setOutputCol("table_image")
splitter.setDropCols("image")
# Detect cells on the table image
cell_detector = ImageTableCellDetector()
cell_detector.setInputCol("table_image")
cell_detector.setOutputCol("cells")
cell_detector.setAlgoType(CellDetectionAlgos.MORPHOPS)
# Extract text from the detected cells
table_recognition = ImageCellsToTextTable()
table_recognition.setInputCol("table_image")
table_recognition.setCellsCol('cells')
table_recognition.setMargin(3)
table_recognition.setStrip(True)
table_recognition.setOutputCol('table')
pipeline = PipelineModel(stages=[
pdf_to_image,
table_detector,
draw_regions,
splitter,
cell_detector,
table_recognition
])

ImageTableCellsDetector detects cells and supports few algorithms:

  • CellDetectionAlgos.MORPHOPS can work with bordered, borderless tables and combined tables.
  • CellDetectionAlgos.CONTOURS can work only with bordered tables, but provide more accurate results.

More details about Table Detection & Extraction pipeline please read here.

4. Run pipeline and show results

Let’s run our pipeline and show the detected table on the page:

results = pipeline.transform(pdf_df).cache()from sparkocr.utils import display_images_horizontal
display_images_horizontal(results, "image, image_with_regions", limit=10)

Table coordinates with probability score we can get from region field:

results.select("region").show(10, False)+-------------------------------------------------+
|region |
+-------------------------------------------------+
|[0, 0, 83.0, 260.0, 3308.0, 1953.0, 0.9999957, 0]|
+-------------------------------------------------+

And as final we can show detected structured data from the table field:

exploded_results = results.select("table", "region") \
.withColumn("cells", f.explode(f.col("table.chunks"))) \
.select([f.col("region.index").alias("table")] + [f.col("cells")[i].getField("chunkText").alias(f"col{i}") for i in
range(0, 8)]) \
exploded_results.show(20, True)

Links

--

--

Mykola Melnyk
spark-nlp

Data Scientist at John Snow Labs. Spark OCR Lead.