Extraction of Tables from PDFs using AWS Textract

5 min readApr 17, 2024

Introduction

AWS Textract is a powerful service provided by Amazon Web Services, designed for optical character recognition (OCR) and information extraction from PDFs and scanned documents. Tables, often embedded within PDFs, are critical data structures used across a multitude of domains such as financial reports, bank statements, and academic research papers.

In this guide, we will explore how to harness AWS Textract to effectively extract tables from PDF documents. We will also delve into the details of capturing and storing the precise coordinates of the cells, rows, and individual text elements within these tables, enabling a deeper analysis and utilization of the extracted data.

Code Implementation

Step 1: Set Up Your AWS Environment

Create an AWS Account: Sign up or log in to your AWS account at https://aws.amazon.com.
Navigate to the IAM Management Console.
Create a new role with access to the Amazon Textract service.
Attach policies such as AmazonTextractFullAccess and AmazonS3ReadOnlyAccess for necessary S3 bucket interaction.

Step 2: Prepare Your Document

Upload Document:
Upload the document from which you want to extract tables to an S3 bucket.
Note the bucket name and document path.

Step 3: Install Required Libraries

Install AWS SDK:

pip install boto3

Step 4: Writing the Code

Import Libraries: boto3 is the library used as the AWS SDK to interact with the AWS services.

import boto3

Initialize the Textract Client: The textract client can be figured using the code below. Make sure to use the right region as different regions have different rate limits for the analyze_document endpoint that is used to extract the tables.

textract = boto3.client('textract', region_name='your-region')

Call Textract to Analyze Document: analyze_document is the endpoint used to extract tables and the code snipped below fetches the pdf from the previously stored bucket_name and document_path . We also specify the FeatureTypes as TABLES.

response = textract.analyze_document(
    Document={'S3Object': {'Bucket': 'your-bucket-name', 'Name': 'your-document-path'}},
    FeatureTypes=['TABLES']
)

Below is the sample JSON response of the analyze_document API call:

{
    "DocumentMetadata": {
        "Pages": 1
    },
    "Blocks": [
        {
            "BlockType": "PAGE",
            "Geometry": {
                "BoundingBox": {
                    "Width": 1.0,
                    "Height": 1.0,
                    "Left": 0,
                    "Top": 0
                },
                "Polygon": [
                    {"X": 0, "Y": 0},
                    {"X": 1, "Y": 0},
                    {"X": 1, "Y": 1},
                    {"X": 0, "Y": 1}
                ]
            },
            "Id": "page1",
            "Relationships": [
                {
                    "Type": "CHILD",
                    "Ids": ["table1"]
                }
            ]
        },
        {
            "BlockType": "TABLE",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.5,
                    "Height": 0.5,
                    "Left": 0.25,
                    "Top": 0.25
                },
                "Polygon": [
                    {"X": 0.25, "Y": 0.25},
                    {"X": 0.75, "Y": 0.25},
                    {"X": 0.75, "Y": 0.75},
                    {"X": 0.25, "Y": 0.75}
                ]
            },
            "Id": "table1",
            "Relationships": [
                {
                    "Type": "CHILD",
                    "Ids": ["row1", "row2"]
                }
            ]
        },
        {
            "BlockType": "ROW",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.5,
                    "Height": 0.1,
                    "Left": 0.25,
                    "Top": 0.25
                },
                "Polygon": [
                    {"X": 0.25, "Y": 0.25},
                    {"X": 0.75, "Y": 0.25},
                    {"X": 0.75, "Y": 0.35},
                    {"X": 0.25, "Y": 0.35}
                ]
            },
            "Id": "row1",
            "Relationships": [
                {
                    "Type": "CHILD",
                    "Ids": ["cell1", "cell2"]
                }
            ]
        },
        {
            "BlockType": "CELL",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.25,
                    "Height": 0.1,
                    "Left": 0.25,
                    "Top": 0.25
                },
                "Polygon": [
                    {"X": 0.25, "Y": 0.25},
                    {"X": 0.5, "Y": 0.25},
                    {"X": 0.5, "Y": 0.35},
                    {"X": 0.25, "Y": 0.35}
                ]
            },
            "Id": "cell1",
            "Text": "Header 1",
            "RowIndex": 1,
            "ColumnIndex": 1
        },
        {
            "BlockType": "CELL",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.25,
                    "Height": 0.1,
                    "Left": 0.5,
                    "Top": 0.25
                },
                "Polygon": [
                    {"X": 0.5, "Y": 0.25},
                    {"X": 0.75, "Y": 0.25},
                    {"X": 0.75, "Y": 0.35},
                    {"X": 0.5, "Y": 0.35}
                ]
            },
            "Id": "cell2",
            "Text": "Header 2",
            "RowIndex": 1,
            "ColumnIndex": 2
        },
        {
            "BlockType": "ROW",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.5,
                    "Height": 0.1,
                    "Left": 0.25,
                    "Top": 0.35
                },
                "Polygon": [
                    {"X": 0.25, "Y": 0.35},
                    {"X": 0.75, "Y": 0.35},
                    {"X": 0.75, "Y": 0.45},
                    {"X": 0.25, "Y": 0.45}
                ]
            },
            "Id": "row2",
            "Relationships": [
                {
                    "Type": "CHILD",
                    "Ids": ["cell3", "cell4"]
                }
            ]
        },
        {
            "BlockType": "CELL",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.25,
                    "Height": 0.1,
                    "Left": 0.25,
                    "Top": 0.35
                },
                "Polygon": [
                    {"X": 0.25, "Y": 0.35},
                    {"X": 0.5, "Y": 0.35},
                    {"X": 0.5, "Y": 0.45},
                    {"X": 0.25, "Y": 0.45}
                ]
            },
            "Id": "cell3",
            "Text": "Data 1",
            "RowIndex": 2,
            "ColumnIndex": 1
        },
        {
            "BlockType": "CELL",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.25,
                    "Height": 0.1,
                    "Left": 0.5,
                    "Top": 0.35
                },
                "Polygon": [
                    {"X": 0.5, "Y": 0.35},
                    {"X": 0.75, "Y": 0.35},
                    {"X": 0.75, "Y": 0.45},
                    {"X": 0.5, "Y": 0.45}
                ]
            },
            "Id": "cell4",
            "Text": "Data 2",
            "RowIndex": 2,
            "ColumnIndex": 2
        }
    ]
}

Extract and Print Table Data and Coordinates: The sample code to access and print the required values from the table is shown in the below code snippet.

for block in response['Blocks']:
    if block['BlockType'] == 'TABLE':
        print(f"Table detected: {block['Id']}")
        for relationship in block.get('Relationships', []):
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    cell = next(b for b in response['Blocks'] if b['Id'] == child_id)
                    if cell['BlockType'] == 'CELL':
                        print(f"Cell {cell['RowIndex']}-{cell['ColumnIndex']}:")
                        print(f"  Text: {cell.get('Text', '')}")
                        print(f"  Coordinates:")
                        print(f"    Top: {cell['Geometry']['BoundingBox']['Top']}")
                        print(f"    Left: {cell['Geometry']['BoundingBox']['Left']}")
                        print(f"    Width: {cell['Geometry']['BoundingBox']['Width']}")
                        print(f"    Height: {cell['Geometry']['BoundingBox']['Height']}")

Step 5: Running Your Code

Run your script in your preferred Python environment.
Ensure your AWS credentials (access key ID and secret access key) are configured correctly in your environment.

Sample Output

Table detected: TableId
Cell 1-1:
  Text: Example Text
  Coordinates:
    Top: 0.123
    Left: 0.456
    Width: 0.789
    Height: 0.012

Conclusion

This straightforward Python script empowers you to effectively extract table data from documents using AWS Textract. It furnishes you with both the text and the precise bounding box coordinates for each cell, proving invaluable for a variety of applications that demand meticulous document analysis. Whether you’re aiming to automate data entry, enhance document management systems, or facilitate deeper data insights, this tool equips you with the essential data extraction capabilities to advance your objectives efficiently.