Amazon Textract

AMIT JAIN
4 min readOct 20, 2021

--

Amazon Textract automatically extract printed text, handwriting and dense text data from scanned documents without any machine learning (ML) experience using artificial intelligence (AI), without configuration, training, or custom code.

Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

This allows you to use Amazon Textract to instantly “read” virtually any type of document and accurately extract text and data without the need for any manual effort or custom code.

The following images show an example document and corresponding extracted text, form, and table data using Amazon Textract in the AWS Management Console.

The following image shows the lines extracted as raw text from the document.

The following image shows the extracted form fields and their corresponding values.

The following image shows the extracted table, cells, and the text in those cells.

To quickly download a zip file containing the output, choose Download results.

You can choose various formats, including raw JSON, text, and CSV files for forms and tables.

In addition to the detected content, Amazon Textract provides additional information, like confidence scores and bounded boxes for detected elements. It gives you control on how you consume extracted content and integrate it into various business applications.

Amazon Textract provides both synchronous and asynchronous API actions to extract document text and analyze the document text data. Synchronous APIs can be used for single-page documents and low latency use cases such as mobile capture.

Asynchronous APIs can be used for multi-page documents such as PDF documents with thousands of pages.

Use cases

Text detection from documents

Multi-column detection and reading order

Traditional OCR solutions read left to right, do not detect multiple columns, and end up generating incorrect reading order for multi-column documents. In addition to detecting text, Amazon Textract provides additional geometry information that can be used to detect multiple columns and print the text in reading order.

Form extraction and processing

Amazon Textract can provide the inputs required to automatically process forms without human intervention.

Compliance control with document redaction

AWS helps secure infrastructure so that you can maintain compliance with information controls. For example, user can use Amazon Textract to feed a workflow that automatically redacts personally identifiable information (PII) for review before archiving claim forms. Amazon Textract recognizes the important fields that require protection.

Table extraction and processing

Amazon Textract can detect tables and their content.

Handwriting Recognition

Many documents such as medical intake forms or employment applications contain both handwritten and printed text. Amazon Textract can extract printed text and handwriting from documents written in English with high confidence scores, whether it is free-form text or text embedded in tables and forms. Documents can also contain a mix of typed text or handwritten text.

Invoices and Receipts

Amazon Textract can extract relevant data such as contact information, items purchased, and vendor name, from almost any invoice or receipt without the need for any templates or configuration. Invoices and receipts come in various layouts which makes it difficult and time consuming to manually extract data at scale. Amazon Textract uses ML to understand the context of invoices and receipts and automatically extracts data such as vendor name, invoice number, item prices, total amount, and payment terms to suite your business needs.

PDF document processing

For more information , Click here:

--

--