How to extract text from document using Amazon Textract, S3 and Lambda API

4 min readJul 9, 2023

Extraction of text from Document using Amazon Extract

💠In this use case, we will see how to extract text from document using Amazon Textract, AWS Lambda and S3 Bucket.

🎯 Example demo of Analyze Document API

📍Go to AWS console, search for Textract and click on analyze document. Upload one image file through choose document which contain table information. Select data output with forms, tables and queries and enter the query as Full Name and add query. Click on Apply Configuration and review the result shown

📍 Review the Results tab as below

📍 Review the Forms tab as below

📍 Review the Tables tab as below

🎯 Adding layers to the lambda function

🎄 Go to AWS console, search for Lambda and select Layer. Choose create layer and add the following inputs as below

🎯 Adding Lambda code function and creating a trigger

🌊 In the same Lambda, click on functions and add the below lambda code by choosing run time as Python 3.8 and creating IAM role

🌊 Create a function labFunction and add the below Python code

import json
import logging
import boto3

from trp import Document
from urllib.parse import unquote_plus

logger = logging.getLogger()
logger.setLevel(logging.INFO)

s3 = boto3.client('s3')

output_key = "output/textract_response.json"


def lambda_handler(event, context):
    
    logger.info(event)
    for record in event['Records']:
        
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'])

        textract = boto3.client('textract')

        try:
            response = textract.analyze_document(   
                Document={                          
                    'S3Object': {
                        'Bucket': bucket,
                        'Name': key
                    }
                },
                FeatureTypes=['<Enter_Your_Feature_Type>',  

            doc = Document(response)  

            for page in doc.pages:
  
                print("Fields:")
                for field in page.form.fields:
                    print("Key: {}, Value: {}".format(field.key, field.value))

            return_result = {"Status": "Success"}
          
            s3.put_object(
                Bucket=bucket,
                Key=output_key,
                Body=json.dumps(response, indent=4)
            )

            return return_result
        except Exception as error:
            return {"Status": "Failed", "Reason": json.dumps(error, default=str)}

🌊 In the above code, we need to enter the input for feature type which is FORMS as per documentation

🌊Deploy the code, once necessary changes are made.

🎯 Configuring timeout and adding layer to your lambda function

📌 Go to configuration and edit the general configuration and add the timeout as 1 min

📌 Scroll down below to code and add the custom layer as below

🎯 Adding trigger to your Lambda Function

🌍 Click on Add trigger, add the S3 Bucket and input folder which has been created.

🌍We will get the below notification once trigger configured successfully

📢Hands-on Demo

🎯 With this, we will see the demo of how this textract happens, the input image which we are uploading in input s3 folder is converted to json format in the output folder of s3 bucket. Same can be viewed in cloudwatch logs as well

🌍Instructions to clean up AWS resource to avoid Billing

📌 Delete the S3 bucket created

📌 Delete the lambda function created once trigger point is removed

Thanks for being patient and followed me. Keep supporting 🙏

Clap👏 if you liked the blog

For more exercises — pls do follow me below ✅!

https://www.linkedin.com/in/vijayaraghavanvashudevan/

#AWS #AWSCommunityBuilder #AWSreSkill #AWSLambda #AmazonExtract #S3Bucket