How to extract text from document using Amazon Textract, S3 and Lambda API

Vijayaraghavan Vashudevan
4 min readJul 9, 2023

--

Extraction of text from Document using Amazon Extract

πŸ’ In this use case, we will see how to extract text from document using Amazon Textract, AWS Lambda and S3 Bucket.

🎯 Example demo of Analyze Document API

πŸ“Go to AWS console, search for Textract and click on analyze document. Upload one image file through choose document which contain table information. Select data output with forms, tables and queries and enter the query as Full Name and add query. Click on Apply Configuration and review the result shown

Amazon Textract

πŸ“ Review the Results tab as below

Results

πŸ“ Review the Forms tab as below

Forms tab

πŸ“ Review the Tables tab as below

Tables tab

🎯 Adding layers to the lambda function

πŸŽ„ Go to AWS console, search for Lambda and select Layer. Choose create layer and add the following inputs as below

Adding textract dependencies

🎯 Adding Lambda code function and creating a trigger

🌊 In the same Lambda, click on functions and add the below lambda code by choosing run time as Python 3.8 and creating IAM role

Lambda function

🌊 Create a function labFunction and add the below Python code

import json
import logging
import boto3

from trp import Document
from urllib.parse import unquote_plus

logger = logging.getLogger()
logger.setLevel(logging.INFO)

s3 = boto3.client('s3')

output_key = "output/textract_response.json"


def lambda_handler(event, context):

logger.info(event)
for record in event['Records']:

bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'])

textract = boto3.client('textract')

try:
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': bucket,
'Name': key
}
},
FeatureTypes=['<Enter_Your_Feature_Type>',

doc = Document(response)

for page in doc.pages:

print("Fields:")
for field in page.form.fields:
print("Key: {}, Value: {}".format(field.key, field.value))

return_result = {"Status": "Success"}

s3.put_object(
Bucket=bucket,
Key=output_key,
Body=json.dumps(response, indent=4)
)

return return_result
except Exception as error:
return {"Status": "Failed", "Reason": json.dumps(error, default=str)}

🌊 In the above code, we need to enter the input for feature type which is FORMS as per documentation

🌊Deploy the code, once necessary changes are made.

labFunction Deployment

🎯 Configuring timeout and adding layer to your lambda function

πŸ“Œ Go to configuration and edit the general configuration and add the timeout as 1 min

Adding timeout

πŸ“Œ Scroll down below to code and add the custom layer as below

Custom layer

🎯 Adding trigger to your Lambda Function

🌍 Click on Add trigger, add the S3 Bucket and input folder which has been created.

trigger configuration

🌍We will get the below notification once trigger configured successfully

Adding trigger

πŸ“’Hands-on Demo

🎯 With this, we will see the demo of how this textract happens, the input image which we are uploading in input s3 folder is converted to json format in the output folder of s3 bucket. Same can be viewed in cloudwatch logs as well

🌍Instructions to clean up AWS resource to avoid Billing

πŸ“Œ Delete the S3 bucket created

πŸ“Œ Delete the lambda function created once trigger point is removed

Thanks for being patient and followed me. Keep supporting πŸ™

ClapπŸ‘ if you liked the blog

For more exercises β€” pls do follow me below βœ…!

https://www.linkedin.com/in/vijayaraghavanvashudevan/

#AWS #AWSCommunityBuilder #AWSreSkill #AWSLambda #AmazonExtract #S3Bucket

--

--

Vijayaraghavan Vashudevan

Hi Everyone !! Am here to publish the technical topics for the community which includes cloud concepts, postman, automation, RPA etc. Please do follow me :)