Extraction of Tables from PDFs using AWS Textract
Introduction
AWS Textract is a powerful service provided by Amazon Web Services, designed for optical character recognition (OCR) and information extraction from PDFs and scanned documents. Tables, often embedded within PDFs, are critical data structures used across a multitude of domains such as financial reports, bank statements, and academic research papers.
In this guide, we will explore how to harness AWS Textract to effectively extract tables from PDF documents. We will also delve into the details of capturing and storing the precise coordinates of the cells, rows, and individual text elements within these tables, enabling a deeper analysis and utilization of the extracted data.
Code Implementation
Step 1: Set Up Your AWS Environment
- Create an AWS Account: Sign up or log in to your AWS account at https://aws.amazon.com.
- Navigate to the IAM Management Console.
- Create a new role with access to the Amazon Textract service.
- Attach policies such as
AmazonTextractFullAccess
andAmazonS3ReadOnlyAccess
for necessary S3 bucket interaction.
Step 2: Prepare Your Document
- Upload Document:
- Upload the document from which you want to extract tables to an S3 bucket.
- Note the
bucket name
anddocument path
.
Step 3: Install Required Libraries
- Install AWS SDK:
pip install boto3
Step 4: Writing the Code
- Import Libraries:
boto3
is the library used as the AWS SDK to interact with the AWS services.
import boto3
- Initialize the Textract Client: The textract client can be figured using the code below. Make sure to use the right region as different regions have different rate limits for the
analyze_document
endpoint that is used to extract the tables.
textract = boto3.client('textract', region_name='your-region')
- Call Textract to Analyze Document:
analyze_document
is the endpoint used to extract tables and the code snipped below fetches the pdf from the previously storedbucket_name
anddocument_path
. We also specify theFeatureTypes
asTABLES
.
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'your-bucket-name', 'Name': 'your-document-path'}},
FeatureTypes=['TABLES']
)
Below is the sample JSON response of the analyze_document
API call:
{
"DocumentMetadata": {
"Pages": 1
},
"Blocks": [
{
"BlockType": "PAGE",
"Geometry": {
"BoundingBox": {
"Width": 1.0,
"Height": 1.0,
"Left": 0,
"Top": 0
},
"Polygon": [
{"X": 0, "Y": 0},
{"X": 1, "Y": 0},
{"X": 1, "Y": 1},
{"X": 0, "Y": 1}
]
},
"Id": "page1",
"Relationships": [
{
"Type": "CHILD",
"Ids": ["table1"]
}
]
},
{
"BlockType": "TABLE",
"Geometry": {
"BoundingBox": {
"Width": 0.5,
"Height": 0.5,
"Left": 0.25,
"Top": 0.25
},
"Polygon": [
{"X": 0.25, "Y": 0.25},
{"X": 0.75, "Y": 0.25},
{"X": 0.75, "Y": 0.75},
{"X": 0.25, "Y": 0.75}
]
},
"Id": "table1",
"Relationships": [
{
"Type": "CHILD",
"Ids": ["row1", "row2"]
}
]
},
{
"BlockType": "ROW",
"Geometry": {
"BoundingBox": {
"Width": 0.5,
"Height": 0.1,
"Left": 0.25,
"Top": 0.25
},
"Polygon": [
{"X": 0.25, "Y": 0.25},
{"X": 0.75, "Y": 0.25},
{"X": 0.75, "Y": 0.35},
{"X": 0.25, "Y": 0.35}
]
},
"Id": "row1",
"Relationships": [
{
"Type": "CHILD",
"Ids": ["cell1", "cell2"]
}
]
},
{
"BlockType": "CELL",
"Geometry": {
"BoundingBox": {
"Width": 0.25,
"Height": 0.1,
"Left": 0.25,
"Top": 0.25
},
"Polygon": [
{"X": 0.25, "Y": 0.25},
{"X": 0.5, "Y": 0.25},
{"X": 0.5, "Y": 0.35},
{"X": 0.25, "Y": 0.35}
]
},
"Id": "cell1",
"Text": "Header 1",
"RowIndex": 1,
"ColumnIndex": 1
},
{
"BlockType": "CELL",
"Geometry": {
"BoundingBox": {
"Width": 0.25,
"Height": 0.1,
"Left": 0.5,
"Top": 0.25
},
"Polygon": [
{"X": 0.5, "Y": 0.25},
{"X": 0.75, "Y": 0.25},
{"X": 0.75, "Y": 0.35},
{"X": 0.5, "Y": 0.35}
]
},
"Id": "cell2",
"Text": "Header 2",
"RowIndex": 1,
"ColumnIndex": 2
},
{
"BlockType": "ROW",
"Geometry": {
"BoundingBox": {
"Width": 0.5,
"Height": 0.1,
"Left": 0.25,
"Top": 0.35
},
"Polygon": [
{"X": 0.25, "Y": 0.35},
{"X": 0.75, "Y": 0.35},
{"X": 0.75, "Y": 0.45},
{"X": 0.25, "Y": 0.45}
]
},
"Id": "row2",
"Relationships": [
{
"Type": "CHILD",
"Ids": ["cell3", "cell4"]
}
]
},
{
"BlockType": "CELL",
"Geometry": {
"BoundingBox": {
"Width": 0.25,
"Height": 0.1,
"Left": 0.25,
"Top": 0.35
},
"Polygon": [
{"X": 0.25, "Y": 0.35},
{"X": 0.5, "Y": 0.35},
{"X": 0.5, "Y": 0.45},
{"X": 0.25, "Y": 0.45}
]
},
"Id": "cell3",
"Text": "Data 1",
"RowIndex": 2,
"ColumnIndex": 1
},
{
"BlockType": "CELL",
"Geometry": {
"BoundingBox": {
"Width": 0.25,
"Height": 0.1,
"Left": 0.5,
"Top": 0.35
},
"Polygon": [
{"X": 0.5, "Y": 0.35},
{"X": 0.75, "Y": 0.35},
{"X": 0.75, "Y": 0.45},
{"X": 0.5, "Y": 0.45}
]
},
"Id": "cell4",
"Text": "Data 2",
"RowIndex": 2,
"ColumnIndex": 2
}
]
}
- Extract and Print Table Data and Coordinates: The sample code to access and print the required values from the table is shown in the below code snippet.
for block in response['Blocks']:
if block['BlockType'] == 'TABLE':
print(f"Table detected: {block['Id']}")
for relationship in block.get('Relationships', []):
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
cell = next(b for b in response['Blocks'] if b['Id'] == child_id)
if cell['BlockType'] == 'CELL':
print(f"Cell {cell['RowIndex']}-{cell['ColumnIndex']}:")
print(f" Text: {cell.get('Text', '')}")
print(f" Coordinates:")
print(f" Top: {cell['Geometry']['BoundingBox']['Top']}")
print(f" Left: {cell['Geometry']['BoundingBox']['Left']}")
print(f" Width: {cell['Geometry']['BoundingBox']['Width']}")
print(f" Height: {cell['Geometry']['BoundingBox']['Height']}")
Step 5: Running Your Code
- Run your script in your preferred Python environment.
- Ensure your AWS credentials (access key ID and secret access key) are configured correctly in your environment.
Sample Output
Table detected: TableId
Cell 1-1:
Text: Example Text
Coordinates:
Top: 0.123
Left: 0.456
Width: 0.789
Height: 0.012
Conclusion
This straightforward Python script empowers you to effectively extract table data from documents using AWS Textract. It furnishes you with both the text and the precise bounding box coordinates for each cell, proving invaluable for a variety of applications that demand meticulous document analysis. Whether you’re aiming to automate data entry, enhance document management systems, or facilitate deeper data insights, this tool equips you with the essential data extraction capabilities to advance your objectives efficiently.