How to extract text from document using Amazon Textract, S3 and Lambda API
π In this use case, we will see how to extract text from document using Amazon Textract, AWS Lambda and S3 Bucket.
π― Example demo of Analyze Document API
πGo to AWS console, search for Textract and click on analyze document. Upload one image file through choose document which contain table information. Select data output with forms, tables and queries and enter the query as Full Name and add query. Click on Apply Configuration and review the result shown
π Review the Results tab as below
π Review the Forms tab as below
π Review the Tables tab as below
π― Adding layers to the lambda function
π Go to AWS console, search for Lambda and select Layer. Choose create layer and add the following inputs as below
π― Adding Lambda code function and creating a trigger
π In the same Lambda, click on functions and add the below lambda code by choosing run time as Python 3.8 and creating IAM role
π Create a function labFunction and add the below Python code
import json
import logging
import boto3
from trp import Document
from urllib.parse import unquote_plus
logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client('s3')
output_key = "output/textract_response.json"
def lambda_handler(event, context):
logger.info(event)
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'])
textract = boto3.client('textract')
try:
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': bucket,
'Name': key
}
},
FeatureTypes=['<Enter_Your_Feature_Type>',
doc = Document(response)
for page in doc.pages:
print("Fields:")
for field in page.form.fields:
print("Key: {}, Value: {}".format(field.key, field.value))
return_result = {"Status": "Success"}
s3.put_object(
Bucket=bucket,
Key=output_key,
Body=json.dumps(response, indent=4)
)
return return_result
except Exception as error:
return {"Status": "Failed", "Reason": json.dumps(error, default=str)}
π In the above code, we need to enter the input for feature type which is FORMS as per documentation
πDeploy the code, once necessary changes are made.
π― Configuring timeout and adding layer to your lambda function
π Go to configuration and edit the general configuration and add the timeout as 1 min
π Scroll down below to code and add the custom layer as below
π― Adding trigger to your Lambda Function
π Click on Add trigger, add the S3 Bucket and input folder which has been created.
πWe will get the below notification once trigger configured successfully
π’Hands-on Demo
π― With this, we will see the demo of how this textract happens, the input image which we are uploading in input s3 folder is converted to json format in the output folder of s3 bucket. Same can be viewed in cloudwatch logs as well
πInstructions to clean up AWS resource to avoid Billing
π Delete the S3 bucket created
π Delete the lambda function created once trigger point is removed
Thanks for being patient and followed me. Keep supporting π
Clapπ if you liked the blog
For more exercises β pls do follow me below β !
https://www.linkedin.com/in/vijayaraghavanvashudevan/
#AWS #AWSCommunityBuilder #AWSreSkill #AWSLambda #AmazonExtract #S3Bucket