Doc Analyzer

Published in

ScaleCapacity

4 min readOct 11, 2020

AWS Textract Based Document Segregation

Automating Doc’s Data Extraction from any document using Amazon Textract in Python. This project was intended to segregate the submitted Documents and files in colleges and various other institutions based on its title and subject without and manual Interaction.

Amazon Textract is a fully managed machine learning service that automatically extracts text and data from scanned documents that go beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

Many companies today extract data from scanned documents, such as PDF’s, tables, and forms, through manual data entry (that is slow, expensive, and prone to errors), or through simple OCR software that requires manual configuration which needs to be updated each time the form changes to be usable.
To overcome these manual processes, Textract uses machine learning to instantly read and process any type of document, accurately extracting text, forms, tables and, other data without the need for any manual effort or custom code.
With Textract you can quickly automate manual document activities, enabling you to process millions of document pages in hours. Once the information is captured, you can take action on it within your business applications to initiate next steps for a loan application, tax document, enrollment form, or medical claims processing. Additionally, you can create smart search indexes, or add in human reviews with Amazon Augmented AI to review nuanced or sensitive data.

Architecture

The flow of our Document and its processing

Let’s begin to follow the steps to set up the AWS architecture.

Necessities: Your AWS Account.

Step 1: Create an IAM Role

Create a Role with a trusted entity as Lambda Function and Textract.
Attach Policy:
a. S3Readonly Access
b. Textract FullAccess
c. CloudWatch Full Access
Name the role of the LambdaS3Textract role.

Step 2: Create a Cognito User pool and Identity Pool

Create a User pool with name, email, and password information.
Create a forget password policy.
Enable email for sending OTP for new user registration.
Create Identity Pool and specify the User pool details.
Once the pool is created it will create 2 service roles
a. Authenticated-User Role
b. Unauthenticated-User Role
Select the Authenticated-User Role and attach the policy to allow access to S3.

Step 3: Create S3 Bucket for Web Hosting Application

Launch the S3 bucket in the region that you are hosting your services.
Once the Bucket is launched go to properties and enable static web hosting.
Get the source code from the GitHub.
Please change the identity-pool id and User pool information.

Step 4: Create an S3 Bucket for Uploading Files and Storing Documents.

Create the Input folder and update this key in Code.

Step 5: Create a Lambda Function

Create a Lambda Function with Python 3.8
Select Author from Scratch.
Attach the role that we have created in Step 1.
Get the Lambda code from the GitHub.
Add trigger.
a. Select S3.
b. Select Bucket name.
c. Specify the key.
d. PUT request as a trigger.
e. Specify “.pdf” as the suffix.
Increase Timeout time to 5 mins.
Increase memory allocated to 256MB ( Can be according to your use case and complexity of the algorithm in the process).

Textract Call and Algorithm to segregate documents

for id, info in Dict.items():
 for k in info:
 if(Value_item==info[k]):
 key=id
 sub_key=info[k]
 break

When a document is uploaded in the specified bucket successfully. It triggers the Lambda Function. Textract API is called by the Lambda function which returns JSON format output of the Text extracted from the document.

Since this is done for Institution documents that need to be organized and segregated based on year and subject, The Logic is to match the text extracted from the document and match it to the list of keys. This will give us the directory structure to have these documents organized.