Amazon Textract with Boto3

ELAKIA VM
featurepreneur
Published in
4 min readAug 7, 2022

Automatically extract printed text, handwriting, and data from any document

What is AWS Textract?

AWS Textract is a deep learning-based service that converts different types of documents into an editable format. Consider we have hard copies of invoices from different companies and store all the vital information from them on excel/spreadsheets. Usually, we rely on data entry operators to manually enter them, which is hectic, time-consuming, and error-prone. But using Textract, all we need to do is upload our invoices to it and in turn, it returns all the text, forms, key-value pairs, and tables in the documents in a more structured way. AWS Textract also identifies handwritten texts in the documents.

Why AWS Textract?

How easy it will become to extract the data from different kinds of formats with the help of this service from Amazon. OCR (optical character recognition) is another software that offers the same facility to its users but Textract is easier and smarter than OCR so with the launch of Texture the market OCR will get affected surely. The biggest and the most common problem with OCR is it converts the data but not in a recognized way.

Some of the benefits of using Amazon Textract include:

  • Integration of document text detection into your apps — Amazon Textract removes the complexity of building text detection capabilities into your applications by making powerful and accurate analysis available with a simple API. You don’t need computer vision or deep learning expertise to use Amazon Textract to detect document text. With Amazon Textract Text APIs, you can easily build text detection into any web, mobile, or connected device application.
  • s document analysis — Amazon Textract enables you to analyse and extract data quickly from millions of documents, which can accelerate decision-making.
  • Low cost — With Amazon Textract, you only pay for the documents you analyse. There are no minimum fees or upfront commitments. You can get started for free, and save more as you grow with our tiered pricing model.

Starting with boto3:

Prerequisites: Having AWS CLI installed

Install boto3 Module:

pip install boto3

Install trp Module:

pip install trp

Start with importing the necessary modules.

import boto3from 
trp import Document

Here we will store the image in the S3 bucket and with the help of the S3 will use AWS Textract. So For that, you have to create a bucket and upload an image file. Now in the code, we will get the bucket name and the file name for processing.

s3BucketName = <bucket-name>plaintextimage = <image-name>formimage = <image-name>tableimage = <image-name>

Then using the boto3 module we will connect to the AWS Textract service.

textractmodule = boto3.client('textract')

For plain text detection:

response = textractmodule.detect_document_text(Document={   'S3Object': {      'Bucket': s3BucketName,        'Name': plaintextimage
}
})
print ('------------- Print plaintextimage detected text ------------------------------')for item in response["Blocks"]: if item["BlockType"] == "LINE": print (item["Text"])

Using the detect_document_text() by boto3 module, It detects text in the input document. Amazon Textract can detect lines of text and the words that make up a line of text. The input document must be an image in JPEG, PNG, PDF, or TIFF format. DetectDocumentText returns the detected text in an array of Block objects.

In the for loop, we are getting the response and iterating each line, then if the condition is for checking the end of the line. At last, we are printing the result.

For From image detection:

response = textractmodule.analyze_document(Document={    'S3Object': {          'Bucket': s3BucketName,            'Name': formimage            }
},
FeatureTypes=["FORMS"])doc = Document(response)print ('------------- Print Form detected text ------------------------------')for page in doc.pages: for field in page.form.fields: print("Key: {}, Value: {}".format(field.key, field.value))

Using analyze_document() by the boto3 module, we are detecting the image input and return the following information:

  • Form data (key-value pairs): The related information is returned in two Blocks a key and a value.
  • Table and table cell data: A table block contains the object information in the form of cells the object is returned for each cell in the table
  • Lines and words of the text: A-line block contains one or more word block objects. All lines and words that are detected in the document are returned.
  • Queries: A Queries result in a block object containing the answer to the query, the alias associated and an ID that connects it to the query asked. This Block also contains a location and attached confidence score.

And the for loop is for finding the keys and the values.

For Table image detection:

response = textractmodule.analyze_document(Document={    'S3Object': {       'Bucket': s3BucketName,       'Name': tableimage           }      },      FeatureTypes=["TABLES"])doc = Document(response)print ('------------- Print Table detected text ------------------------------')for page in doc.pages:    for table in page.tables:         for r, row in enumerate(table.rows):               itemName  = ""             for c, cell in enumerate(row.cells):                 print("Table[{}][{}] = {}".format(r, c, cell.text))

Here also we are using the same module and two more for loops for getting each input row-wise and column-wise keys and the values.

Reference: For code

Github link: https://github.com/elakiavm/aws-textract-boto3

--

--