Exploring ML Tools — Amazon Textract

Published in

Explore ML

3 min readMay 16, 2020

Documents have been part of our lives since very long time, however, its not always so easy to find something in document. With the new advances in machine learning we can actually digitize our document and thus put it to good use.

Introduction

In this post, we will be exploring Amazon Textract which has the capabilities to extract text and data from documents. It uses Optical Character Recognition(OCR) and augmented AI to detects the document layout and key element from the it.

Amazon textract has been trained on tens of millions of documents including invoices, contracts, receipts, insurance claims, policy documents, etc.

Features

Key-value pair extraction
Table extraction
Bounding Box for all the extracted result
Confidence score for every extracted result
It can handle multi page document

Usecase

Following is one of the examples of how textract can be used to process large scale documents using complementary AWS services.

Large scale document processing with Amazon Textract

Usage

The service can be used directly from aws console or using api in your preferred language like java, python, javascript, etc. We will be using aws python sdk — boto3 to interact with the service.

Input

It accepts images of format (JPEG/PNG) and PDF’s.

For Images, we can perform synchronous or asynchronous call with reading it as bytes or passing s3 location.

To process PDF, only asynchronous call is supported via s3 location.

Initialize the textract client in python, if you are planning to run extract job on files under s3 bucket then, region_name should be same as the location on data.

Input image for detecting and analyzing text in image. This is a receipt from a shop in switzerland — Input image for detecting and analyzing text in image

Output

Detecting text in Image

import boto3textract_client = boto3.client('textract', region_name='eu-west-1')# Reading image file
with open('assets/IMG_2093.JPG', 'rb') as f:
    image = f.read()# Detecting text in Image
img_response_1 = textract_client.detect_document_text(
    Document={
        'Bytes': image
    })

The response contains lot of information regarding the analysis. First of all we get DocumentMetadata which has information like page number of the analysis. And we get many Blocks which can be one of the following value ‘KEY_VALUE_SET’|’PAGE’|’LINE’|’WORD’|’TABLE’|’CELL’|’SELECTION_ELEMENT’.

To extract text from the output:

for block in img_response_1['Blocks']:
    if (block['BlockType'] == 'LINE'):
        print(block['Text'])

Gate Gourmet Switzerland GmbH
(d/b/a Gate Retail Onboard)
Flight Flight Date 2072 24/05/2017 Sector MUC-LTN 19:25:00
Bar Set 7454
Transaction Type SALE
Transaction Date e:24/05/2017
Transaction Time 10:25 AM
Transaction 5b42b840b1c
Item
Price Oty
Water Still
1.80
3
Coffee Reg
2.50
I
SubTotal
4.30
Net Subtotal (GBP)
4.30
Cash (EUR)
5.50
Terinal 00-02-78-b0-e6-ef 00-00
CUSTOMER COPY
Please retain for Vour records

With the above information we also get the actual location of that on a page with Bounding box information. Which can we used to create the indexed search engine of our documents and thus make it searchable on word level.

Following are some example on running textract on S3 object

# Analyzing text in Image
img_response_2 = textract_client.analyze_document(
    Document={
        'Bytes': image
    },
    FeatureTypes=[
        'TABLES',
    ])# Asynchronous call to extract text from pdf
asset_bucket = "<bucket_name>"
pdf_prefix = "amazon-textract/assets/pdf/receipt_uber.pdf"pdf_response_1 = textract_client.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': asset_bucket,
                'Name': pdf_prefix
            }
        })job_id_1 = pdf_response_1['JobId']## check the progress
response = textract_client.get_document_text_detection(
    JobId=job_id
)if response['JobStatus'] == 'IN_PROGRESS':
    print("Job is still in progress")
else:
    print("Extraction job completed")
    print(response)

Findings

Some key findings are:

It returns information of bounding box from where the text has been extracted.
It is not angle invariant, that is, document or image to be analyze must be aligned vertically.

Pricing

No upfront cost
Free tier: Detecting Text — 1000 pages/month for first 3 months, Analyzing Document — 100 pages/month for first 3 months
Detecting Text: First 1 Million pages — $0.0015/page — $1.5/1000 pages, Over 1 Million pages — $0.0006/page — $0.6/1000 pages
The other details for analyze document api can be found at: https://aws.amazon.com/textract/pricing/