AWS Textract with Lambda Walkthrough

Suminda Niroshan
11 min readJun 28, 2019

--

AWS Textract is a document text extraction service.

“Amazon Textract is based on the same proven, highly scalable, deep-learning technology that was developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. You don’t need any machine learning expertise to use it” — AWS Docs

This post will provide a walkthrough of several use cases of AWS Textract service using AWS Lambda with Python implementations. Mainly,

  • Extracting Text from a Base64 Image.
  • Extracting Text from an S3 Bucket Image.
  • Extracting Text from an S3 Bucket Document.

Prerequisites

You need to have an AWS account and some basic knowledge working with AWS services. Following AWS services will be utilized throughout this guide.

  • Lamda Service
  • Textract Service
  • Simple Notification Service
  • Simple Storage Service
  • Identity Access Management Service

You will learn

  • To use synchronous Textract methods for Images using base64 as images and S3 object as images.
  • To use asynchronous Textract methods for PDF text extraction.
  • How to add triggers to Lambda function.
  • How to add API Gateway and expose a Lamda function.
  • How to configure Simple Notification service and utilize with AWS Textract services.
  • How to configure Identity Access Management service to provide access for only necessary services.

Adding boto3

Since Lambda functions are executed in an AWS hosted runtime, some latest packages that are used by Lambda code needs to be uploaded manually. In order to use AWS Textract in Python, the latest “boto3” package is needed which is not currently available in AWS Lambda hosted environments as of this writing which is needed to be downloaded and uploaded as an AWS Lambda “Layer”. Please follow the steps below to achieve this.

1. Python dependency manager “PIP” is needed to download the “boto3” package.

2. Execute following command in command shell.
pip install — target ./python boto3

3. After the package is downloaded, Zip the “python” folder. Alternatively the zipped “boto3-layer” can be grabbed from here.

4. Go to AWS Lambda -> Layers and click “Create Layer”.

5. Give a layer name, select the latest python version and upload the zip file as below.

6. Click “Create”.

This will create a “boto3" Python package for the AWS Textract SDK which will be used as a Lambda layer. Please note that “Compatible runtimes” should be the same as Lambda function’s runtime which will use this layer.

1. Image Text Extraction

This section will focus on text extraction using images (JPEG/PNG).
In the first example, a Base64 converted image will be directly used with AWS SDK to extract text.
In the second example, an S3 bucket triggered Lambda will be used to automatically extract text when images are uploaded to the S3 bucket and write each results to a text file in the S3 bucket.

1.1 Extracting Text from a Base64 Image

1.1.1 Creating The Lambda Function

1. Go to AWS Lambda service and click “Create Function”.

2. Give a “Function name” as below and “Execution role” as “Create a new role from AWS policy templates” and enter a “Role name”. Note that “Runtime” is selected as “Python 3.6”. This is compatible with the “boto3-layer” created previously as the specified runtimes were both “Python 3.6” and “Python 3.7”.

3. Click “Create function”.

1.1.2 Attaching Permission Policies to Lambda

1. Once Lambda is created, click on “View the getTextFromImageRole role” in “Execution role” section in the Lambda configuration as displayed below.

2. This will open the “getTextFromImageRole” configuration page as below. Click “Attach policy” and select “AmazonTextractFullAccess” policy and click “Attach policy” as displayed below. This will give Lambda function permission to access AWS Textract service.

1.1.3 Adding Custom “boto3-layer” to Lambda

1. Click “Layers” from Lambda designer and click “Add a layer” as below to add the “boto3-layer” that was created earlier.

2. Select the “boto3-layer” in “Compatible layers” and select version 1 as below.

3. Click “Save” in lambda configuration.

1.1.4 Implementing Lambda Code

Go to the Lambda code editor and paste the code below.

This code expects a Json body with an “Image” parameter and it’s value as Base64 encoded image string. A sample payload which you can use is displayed below.

1.1.5 Exposing The Lambda Using API Gateway
Currently the Lambda is not exposed over public internet. This needs to be exposed via an AWS API Gateway endpoint. Follow the steps below.

1. Click on “Add trigger” from Lambda configuration page.

2. Select “API Gateway” and fill out details as below. Note that “Security” is set to “Open with API key” to protect Lambda from anonymous access.

3. Click “Add”.

4. Go to Lambda configuration page and expand “API Gateway” section to get API endpoint and API key as displayed below.

1.1.6 Invoking The Lambda Function
Use an API testing tool to invoke the endpoint. In this example “Postman” is used.

  1. Add the API endpoint with headers “x-api-key” with the value of “API Key” and “Content-Type” as “application/json” as below.

2. Copy the sample body provided in section “1.1.4” into the raw body section as below.

3. Execute the call and the response will be returned with the extracted text as below.

1.2 Extracting Text from an S3 Bucket Image

This will consist of a Lambda function which gets triggered whenever an image gets uploaded to S3 Bucket. Follow the steps below and create the S3 bucket.

1.2.1 Creating the S3 Bucket

  1. Go to AWS S3 page and click “Create bucket”.
  2. Enter “Bucket name” and “Region” same as the region that will be used in Lambda function and click “Next”.
  3. In “Set permissions” section, set the permissions as below.

4. Click “Create bucket”.

1.2.2 Creating The S3 Lambda Trigger
Follow the steps below to create a Lamda that will be executed upon new image uploads.

  1. Go to AWS Lambda service page and click “Create function”.
  2. Select “Use a blueprint” and search for “s3-get-object-python” template and click “Configure”.
  3. Enter “Function name”, “Role name” and select the “Bucket name” as the S3 bucket created in the previous step. Make sure to add “Suffix” to restrict the trigger only for PNG images. Fill out the rest of the settings as below.

4. Click “Create function” and copy the code below. The below code will send the uploaded image to the AWS Textract and write the response as a text file with the same name to the S3 bucket.

5. Please follow the steps in section “1.1.2” and add “AmazonTextractFullAccess” and “AmazonS3FullAccess” policies to the “getTextFromS3ImageRole” role that was created for this Lambda. This will provide access to both AWS Textract and S3 services.

6. Please follow the steps in section “1.1.3” and add the “boto3-layer” to the “getTextFromS3Image” lambda.

1.2.3 Testing The S3 Lambda Trigger

Go to the S3 bucket created in step “1.2.1” and upload a png image with some text. You can download a sample image from here.

Once the image is uploaded, after a few seconds extracted text file should be created in the same location with the same name as displayed below.

2 Extracting PDF Text from an S3 Bucket Document.

This example will implement a Lambda which will be triggered whenever a PDF document is uploaded to the S3 bucket. Lambda function will start a text extraction processing job. Once the AWS Textract completes the job, it will send a notification to the AWS Simple Notification Service which will trigger another Lambda. The triggered Lambda from AWS SNS Service will get the text extraction job result from the payload and write the results to a text file in the S3 bucket with the same name as the PDF.

2.1 Creating the S3 Triggered Lambda Function

1. Create another S3 triggered Lambda function following the steps “1.2.2”, the only different being instead of the suffix “png”, “pdf” will be used to trigger only for PDF documents as below.

2. Please follow the steps in section “1.1.2” and add “AmazonTextractFullAccess” and “AmazonS3FullAccess” policies to the “getTextFromS3PDFRole” role that was created for this Lambda. This will provide access to both AWS Textract and S3 services.

3. Please follow the steps in section “1.1.3” and add the “boto3-layer” to the “getTextFromS3PDF” lambda.

The API method “StartDocumentTextDetection” is asynchronous. This methods starts a text extraction process and returns the “JobId”. Once the text extraction process completes, it will trigger a notification to the AWS Simple Notification Service. Follow steps below to create an AWS SNS.

2.2 Creating The AWS SNS Topic

A Simple Notification Service topic is needed for Textract service to send a job completed notification along with “JobId”. Follow following steps to create a topic.

  1. Go to AWS SNS service -> Topics and click “Create topic”.
  2. Enter a Name, keep the rest of the settings as default and create the topic as below.

2.3 Creating an IAM Role AWS SNS Access

An IAM Role is needed to get access to AWS SNS service for AWS Textract to successfully send notification.

  1. Go to IAM -> Roles and click “Create role”.
  2. Select Lambda as the “Service that will use this role” as below.

3. Go to “Permissions” section and add “AmazonSNSFullAccess” policy as below.

4. Give role name as “AWSSNSFullAccessRole” and create the role as below. Please note under “Trusted entities” lambda service is provided. This is because AWS Textract service was not available to select in the initial “Create role” page. We need to change the “lambda” service to “textract” service.

5. Go to “AWSSNSFullAccessRole” role settings and go to “Trust relationships” and click “Edit trust relationship as below”.

6. Change lambda.amazonaws.com to textract.amazonaws.com as below and update the trust policy. AWS Textract service will now have permission to send notifications to AWS SNS.

2.4 Implementing The PDF Triggered Lambda

  1. Go to the “getTextFromS3PDF” Lambda code editor and paste the code below.

2. Replace <RoleArn> with the Role ARN found in IAM Role summary section of “AWSSNSFullAccessRole” that was created in step “1.3.3” as deblow.

3. Replace <SNSTopicArn> with the ARN found in the details section of “PDF_TextProcess_Completed” topic that was created in step “2.2” as below.

2.5 Creating the AWS SNS Triggered Lambda Function

This Lambda will be triggered once the text extraction job completed notification comes from the AWS Textract service.

  1. Create a new Lambda function from scratch with the name “writePDFResultToS3” as below.

2. Please follow the steps in section “1.1.2” and add “AmazonTextractFullAccess” and “AmazonS3FullAccess” policies to the “writePDFResultToS3Role” role that was created for this Lambda. This will provide access to both AWS Textract and S3 services.

3. Please follow the steps in section “1.1.3” and add the “boto3-layer” to the “writePDFResultToS3” lambda.

4. Click “Add trigger” from lambda designer and configure subscription to the “PDF_TextProcess_Completed” topic that was created in step “2.2” as below and add the trigger.

5. Go to the “writePDFResultToS3” Lambda code editor and paste the code below. This code will check if the job status is “SUCCEEDED” and retrieve the job result using the “JobId”, process the result into a text file with the same name as the PDF file and write to the S3 bucket.

2.6 Testing PDF Text Extraction

Go to the S3 bucket and upload a PDF file. You can get a sample PDF file from here. After about 1 minute a text file with the same name as the PDF will be generated as below. This text file contains the text result extracted from the PDF.

--

--