Serverless PDF Processing with AWS Lambda and Textract

Olga Shabalina
9 min read10 hours ago

--

Learn how to use Amazon Textract, S3, and Lambda for event-driven, serverless document processing (scanned PDF, images, etc).

Amazon Textract receives a scanned DPF document and extracts text from it. Generated by DALL-E.

Serverless computing has transformed the way we build applications by eliminating the need to manage servers. In data engineering, this flexibility is especially useful for document processing, where workloads can be unpredictable because files can arrive at any time. While it’s relatively straightforward to process flat and structured files, this isn’t always the case with PDFs, particularly if they are created from scanned documents.

Overview

AWS Textract is a powerful service that automates the extraction of text and data from documents like PDFs and images. You can read more about it in the official AWS documentation. What’s important for our use case, though, is that it’s serverless, fully managed, and does exactly what we need, when we need it. Plus, it’s far more cost-effective than training or using an AI model.

When combined with AWS Lambda and S3, Textract can be triggered automatically whenever a document is uploaded, enabling real-time processing without the hassle of managing infrastructure. In this blog, I’ll demonstrate configuration options using CloudFormation template and Python code, allowing you to recreate a basic version then customise it for your own project.

Synchronous Implementation

The first and easiest option to implement is to use a single Lambda for everything — reading the file, sending it to Textract, waiting for the response, and processing the results. Here’s an overview:

Architecture diagram: file is dropped to our s3 bucket, it sends an s3 notification which triggers Lambda function. Lambda invokes Textract via API. Once text extraction is done, lambda is processing the payload and writes the text to the file in s3.
Architecture diagram S3 — Lambda — Textract — Lambda — S3
  1. Upload: Users upload documents, such as PDFs or scanned images, to an Amazon S3 bucket incoming folder. S3 acts as a secure and scalable storage service that can handle large volumes of data and traffic.
  2. Lambda Trigger via S3 Notification: When a document is uploaded to the S3 bucket under incoming/ prefix, it triggers an AWS Lambda function via S3 notification.
  3. Textract Processing: The triggered Lambda function then calls AWS Textract API, which processes the document and returns the extracted data.
  4. Lambda: The same lambda is processing the json response and writes the extracted text to a flat file in S3 under processed/ prefix.
  5. Notification/Logging: Optionally, we can log processing details to Amazon CloudWatch, which helps in monitoring the application’s performance and logging for debugging purposes.

Configuration and setup

The setup is quite straightforward. You need to configure the Lambda function and its associated execution role. I recommend using parameters for resource names. Be sure to add a policy to your Lambda role that allows the textract:DetectDocumentText operation. Amazon Textract doesn’t require a resource to be provisioned, as it’s a fully managed API-based service. As long as your Lambda has the necessary permissions to call it, you are good to go.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'V1 - S3, Lambda and associated resources.'

Parameters:
BucketName:
Type: String
Description: The name of the S3 bucket.
LambdaFunctionName:
Type: String
Description: The name of the Lambda function.

Resources:
LambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: logs
PolicyDocument:
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: '*'
- PolicyName: s3
PolicyDocument:
Statement:
- Effect: Allow
Action:
- s3:Get*
- s3:PutObject
Resource:
- !Sub arn:aws:s3:::${BucketName}
- !Sub arn:aws:s3:::${BucketName}/*
- PolicyName: textract
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- textract:DetectDocumentText
Resource: '*'

LambdaFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: !Ref LambdaFunctionName
Handler: index.handler
Runtime: python3.12
Code: ../src
Role: !GetAtt LambdaExecutionRole.Arn
Timeout: 10

LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub '/aws/lambda/${LambdaFunction}'
RetentionInDays: 7

The Lambda function is triggered every time a file is uploaded to the S3 bucket through an S3 notification. However, you want to avoid an infinite loop where files are continuously read and dropped into the bucket, causing the Lambda to be invoked repeatedly. To prevent this, ensure that the event is limited to a specific prefix — in this case, incoming/. Additionally, the Lambda function needs permission to be invoked by the S3 event, so make sure the correct permissions are configured.

  S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref BucketName
NotificationConfiguration:
LambdaConfigurations:
- Event: s3:ObjectCreated:*
Filter:
S3Key:
Rules:
- Name: prefix
Value: incoming/
Function: !GetAtt LambdaFunction.Arn

S3InvokeLambdaPermission:
Type: AWS::Lambda::Permission
Properties:
Action: lambda:InvokeFunction
FunctionName: !Ref LambdaFunction
Principal: s3.amazonaws.com
SourceArn: !Sub arn:aws:s3:::${BucketName}

When it comes to the code, here’s an example of how you can structure your Lambda function in Python.

import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
s3 = boto3.client('s3')
textract = boto3.client('textract')

for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']

try:
# Call Textract
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': bucket,
'Name': key
}
}
)

# Extract detected text
detected_text = []
for item in response.get('Blocks', []):
if item['BlockType'] == 'LINE':
detected_text.append(item['Text'])

# Join detected text into a single string
text_output = '\n'.join(detected_text)

# Create a new key for the output text file
file_name = key.split('/')[1].split('.')[0]
output_key = f'processed/{file_name}.txt'

# Write the detected text to the S3 bucket
s3.put_object(
Bucket=bucket,
Key=output_key,
Body=text_output
)

logger.info(f'Detected Text is written to: {output_key}')

except Exception as e:
logger.error(f'Error processing file {key}
from bucket {bucket}: {str(e)}')
continue

return {
'statusCode': 200,
'body': json.dumps('Textract processing complete!')
}

This architecture is simple and well-suited for small workloads with single-page files. However, it’s important to keep in mind that a Lambda function has a 15-minute time limit. Additionally, Textract has limitations for synchronous operations.

For example:

  • JPEG, PNG, PDF, and TIFF files are limited to 10 MB in memory.
  • PDF and TIFF files are restricted to a maximum of 1 page.

For more details, refer to the quotas in Amazon Textract. These limitations lead us to the second option, which extends both the Lambda execution time and Textract capabilities.

Asynchronous Implementation

For asynchronous operations, while JPEG and PNG files still have a 10 MB limit in Textract memory, PDF and TIFF files benefit from significantly higher limits. PDF and TIFF files can now handle up to 500 MB and a maximum of 3,000 pages — a huge improvement compared to synchronous operations.

Architecture diagram: file is dropped to our s3 bucket, it sends an s3 notification which triggers Lambda function. Lambda invokes  Textract. Once text extraction is done, status update is sent to SNS. Lambda is triggered by SNS which gets the payload and writes it to a flat file in s3.
Architecture diagram s3 — Lambda — Textract — Lambda — SNS
  1. Upload: Users upload documents, such as PDFs or scanned images, to an Amazon S3 bucket in the incoming folder.
  2. Lambda Trigger via S3 Notification: When a document is uploaded to the S3 bucket, it triggers an AWS Lambda function via an S3 notification.
  3. Textract Processing: The triggered Lambda function calls AWS Textract, which processes the document.
  4. Lambda Trigger via SNS: Once Textract completes the document processing, it sends a message to AWS SNS, which triggers another Lambda function.
  5. Post-Processing: he second Lambda function can further process the extracted data by formatting it into a structured format (e.g., JSON, CSV) and storing it in an S3 bucket or a database like Amazon RDS or DynamoDB for easy retrieval and analysis.
  6. Notification/Logging: Optionally, processing details can be logged to Amazon CloudWatch to monitor the application’s performance and assist with debugging.

Configuration and setup

Let’s begin by defining parameters and setting up the S3 bucket with notifications, similar to the previous solution. This will include configuring the S3 bucket to trigger Lambda functions when files are uploaded, using S3 event notifications.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'V2 - S3, Lambdas, SNS and associated resources.'

Parameters:
BucketName:
Type: String
Description: The name of the S3 bucket.
DocProcessingLambdaFunctionName:
Type: String
Description: The name of the Lambda function.
PostProcessingLambdaFunctionName:
Type: String
Description: The name of the Lambda function for post-processing.
TextractSNSTriggerRoleName:
Type: String
Description: The name of the IAM role for Textract SNS trigger.

Resources:
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref BucketName
NotificationConfiguration:
LambdaConfigurations:
- Event: s3:ObjectCreated:*
Filter:
S3Key:
Rules:
- Name: prefix
Value: incoming/
Function: !GetAtt DocProcessingLambdaFunction.Arn

Now, we define the first Lambda function and its associated resources. This Lambda will send the file to Textract for processing without waiting for a callback. The key difference here is that the Lambda execution role no longer requires the textract:DetectDocumentText permission. Instead, it will need the permission to perform textract:StartDocumentTextDetection.

  DocProcessingLambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: logs
PolicyDocument:
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: '*'
- PolicyName: s3
PolicyDocument:
Statement:
- Effect: Allow
Action:
- s3:Get*
Resource:
- !Sub arn:aws:s3:::${BucketName}
- !Sub arn:aws:s3:::${BucketName}/*
- PolicyName: textract
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- textract:StartDocumentTextDetection
Resource: '*'

DocProcessingLambdaFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: !Ref DocProcessingLambdaFunctionName
Handler: index.handler
Runtime: python3.12
Code: ../src
Role: !GetAtt DocProcessingLambdaExecutionRole.Arn
Timeout: 10
Environment:
Variables:
TEXTRACT_NOTIFICATION_TOPIC: !Ref TextractNotificationTopic
TEXTRACT_ROLE_ARN: !GetAtt TextractSNSTriggerRole.Arn

S3InvokeDocLambdaPermission:
Type: AWS::Lambda::Permission
Properties:
Action: lambda:InvokeFunction
FunctionName: !Ref DocProcessingLambdaFunction
Principal: s3.amazonaws.com
SourceArn: !Sub arn:aws:s3:::${BucketName}

DocProcessingLambdaLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub '/aws/lambda/${DocProcessingLambdaFunctionName}'
RetentionInDays: 7

Once the job is sent to Textract, the Lambda function is no longer responsible for managing it. This means Textract will need its own role to send a notification to SNS when the job is completed. The role and SNS topic ARN will be passed as environment variables to the Lambda function, allowing it to pass them to Textract along with the job.

  TextractNotificationTopic:
Type: AWS::SNS::Topic
Properties:
DisplayName: Textract Notification Topic

TextractSNSTriggerRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Ref TextractSNSTriggerRoleName
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: textract.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: TextractSNSPublishPolicy
PolicyDocument:
Statement:
- Effect: Allow
Action:
- sns:Publish
Resource: !Ref TextractNotificationTopic

In the code, the start_document_text_detection function initiates a Textract job to process the document stored in our S3 bucket. The DocumentLocation section specifies the S3 bucket and file to be analyzed, while the NotificationChannel defines the SNS topic ARN and the IAM role that Textract will use to send notifications. These values are coming from the environment variables that we passed through the CloudFormation template earlier.

import boto3
import logging
import os

logger = logging.getLogger()
logger.setLevel(logging.INFO)

sns = boto3.client('sns')
textract = boto3.client('textract')

def handler(event, context):
topic_arn = os.environ['TEXTRACT_NOTIFICATION_TOPIC']
textract_role = os.environ['TEXTRACT_ROLE_ARN']

for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']

try:
# Start Textract asynchronous processing, use env vars
response = textract.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': bucket,
'Name': key
}
},
NotificationChannel={
'SNSTopicArn': topic_arn,
'RoleArn': textract_role
}
)

logger.info(f"File {key} is sent to Textract.")

except Exception as e:
logger.error(f"Error processing file {key}
from bucket {bucket}: {str(e)}")
continue

return {
'statusCode': 200,
'body': 'Textract processing initiation is complete!'
}

The second Lambda function will need the textract:GetDocumentTextDetection permission to retrieve the results from Textract once it's invoked by the SNS topic. This allows the Lambda to access the output of the Textract job and process the extracted text accordingly.

   PostProcessingLambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: logs
PolicyDocument:
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: '*'
- PolicyName: s3
PolicyDocument:
Statement:
- Effect: Allow
Action:
- s3:PutObject
Resource:
- !Sub arn:aws:s3:::${BucketName}
- !Sub arn:aws:s3:::${BucketName}/*
- PolicyName: textract
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- textract:GetDocumentTextDetection
Resource: '*'
- PolicyName: sns
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- sns:subscribe
Resource: !Ref TextractNotificationTopic

PostProcessingLambdaFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: !Ref PostProcessingLambdaFunctionName
Handler: index.handler
Runtime: python3.12
Code: ../src
Role: !GetAtt PostProcessingLambdaExecutionRole.Arn
Timeout: 10

PostProcessingLambdaLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub '/aws/lambda/${PostProcessingLambdaFunctionName}'
RetentionInDays: 7

S3InvokePostLambdaPermission:
Type: AWS::Lambda::Permission
Properties:
Action: lambda:InvokeFunction
FunctionName: !Ref PostProcessingLambdaFunction
Principal: sns.amazonaws.com
SourceArn: !Ref TextractNotificationTopic

PostProcessingLambdaSubscription:
Type: AWS::SNS::Subscription
Properties:
Protocol: lambda
TopicArn: !Ref TextractNotificationTopic
Endpoint: !GetAtt PostProcessingLambdaFunction.Arn

This Lambda function is also responsible for processing the results. You can implement more complex transformations depending on your specific use case, but here’s an example of appending the extracted text and saving it to a text file in an S3 bucket.

import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

textract = boto3.client('textract')
s3 = boto3.client('s3')

def handler(event, context):
for record in event['Records']:
try:
# The SNS message with job information
sns_message = json.loads(record['Sns']['Message'])

# Accessing the keys for getting Textract results
job_id = sns_message['JobId']
status = sns_message['Status']

# Accessing the keys for destination
bucket = sns_message['DocumentLocation']['S3Bucket']
s3_object_key = sns_message['DocumentLocation']['S3ObjectName']
file_name = s3_object_key.split('/')[1].split('.')[0]

if status == 'SUCCEEDED':
# Proceed to get the document text detection results
response = textract.get_document_text_detection(JobId=job_id)

# Collect extracted text
detected_text = []
for item in response.get('Blocks', []):
if item['BlockType'] == 'LINE':
detected_text.append(item['Text'])

# Save collected text to S3
output_key = f"processed/{file_name}.txt"
s3.put_object(
Bucket=bucket,
Key=output_key,
Body="\n".join(detected_text)
)
logger.info(f"Detected text is written to S3/{output_key}")

elif status == 'FAILED':
logger.error(f"Job {job_id} failed.")

except KeyError as e:
logger.error(f"KeyError: Missing expected key {str(e)}
in the message: {sns_message}")
except Exception as e:
logger.error(f"Error processing job {job_id}: {str(e)}")

return {
'statusCode': 200,
'body': 'Notification processed successfully!'
}

This asynchronous architecture is a robust solution for automating document processing tasks, offering greater flexibility in handling larger documents, particularly PDFs and TIFF files. It ensures scalability while overcoming the size and page limitations of synchronous processing.

Summary

In this article, we explored how to build a serverless document processing solution using AWS Lambda and Textract, offering two distinct approaches depending on your workload. The first approach uses a simple synchronous setup, ideal for small workloads with single-page documents. It’s easy to implement and manage, making it perfect for scenarios where the document size is small, and quick processing is needed.

However, for larger workloads — particularly when dealing with PDFs and TIFF files that may contain multiple pages or large file sizes — the second approach, an asynchronous architecture, is essential. This more advanced setup offers greater flexibility, allowing for the processing of documents up to 500 MB and 3,000 pages. It uses a two-step Lambda process along with S3 and SNS to ensure scalability without running into the limitations of synchronous execution.

By choosing the appropriate architecture for your needs, you can balance ease of setup with the ability to handle larger, more complex document processing tasks.

--

--