Unveiling Amazon Textract: An In-Depth Exploration

Rumana Shaikh
Ankercloud Engineering
10 min readOct 9, 2023
Textract

In today’s data-driven world, extracting information from documents, whether they’re printed or handwritten, is a critical task. Amazon Textract, a part of Amazon Web Services (AWS), has emerged as a powerful solution for automating this process. In this blog, we’ll delve into Amazon Textract, exploring its features, use cases, benefits, and how it can revolutionise document analysis for businesses and organisations.

What is Amazon Textract?

Amazon Textract is a fully managed machine learning service offered by AWS. Its primary purpose is to extract text and data from documents in various formats, including PDFs, images, and scanned documents. Textract employs state-of-the-art machine learning techniques to automatically process documents, making it a versatile tool for businesses and organisations dealing with large volumes of unstructured data.

Key Features of Amazon Textract

Let’s take a closer look at some of the key features that make Amazon Textract a game-changer:

1. Document Text Extraction:

Amazon Textract can accurately extract text from a wide range of document types, including contracts, invoices, forms, and reports. This text extraction includes both printed and handwritten content.

2. Form and Table Data Extraction:

Amazon Textract automatically recognizes tables, forms, and structured data in documents, including key-value pairs. For example, ‘First Name’ (key) and ‘Jane’ (value) are identified as linked data items. This enables seamless data import into databases or variable use in applications, eliminating the need for manual intervention or complex rule coding found in traditional OCR solutions.

Learn more »

Amazon Textract extracts data from tables, useful for structured documents like financial reports or medical records. It preserves table composition, enabling easy database integration with predefined schemas. For instance, item numbers and quantities in an inventory report remain associated for streamlined inventory management.

Learn more »

3. Automatic Content Classification:

This feature automatically classifies documents into predefined categories. For instance, it can distinguish between invoices, purchase orders, and receipts, allowing for streamlined document processing workflows.

4. Page-Level Relationships:

Amazon Textract maintains element relationships, including text, tables, and images, ensuring contextually accurate data extraction. Extracted data includes bounding box coordinates, aiding in source document reference and guiding users in document searches, such as medical record queries for patient history.

Learn more »

5. Optical character recognition

Amazon Textract uses optical character recognition (OCR) to automatically detect printed text, handwriting, and numbers in a scan or rendering of a document, such as a legal document or a scan of a book.

Learn more »

6. Analyze ID

This API will read and extract data from identity documents such U.S. Driver’s license, and U.S. Passports making it easy for customer to automate and expedite their document processing. You can immediately start extracting implicit fields like name, address, as well as explicit fields like date of birth, date of issue, date of expiry, ID #, ID type, and much more in the form of key- value pairs.

Learn more »

7. Signature Detection

Amazon Textract provides the ability to detect signatures on any document or image. This makes it easy to automatically detect signatures on documents such as checks, loan application forms, and claims forms. The location of the signatures and associated confidence scores are included in the API response.

Learn more »

8. Analyze Expense

Amazon Textract automates data extraction from invoices and receipts, regardless of layouts. It extracts key details like date, number, prices, and payment terms. It can also identify vendor names, even within logos. Textract standardises data across document types and categorizes non-standard fields as “OTHER.”
Learn more>>

9. Analyze Lending

Analyze Lending is an API for rapid mortgage document analysis. It classifies and routes documents to Amazon Textract for analysis and provides categorized results. To test, upload your document package (up to 10 pages, < 5MB) in JPEG, PNG, TIFF, or PDF format. Results show document types in ‘Select a document’ and analysis details in ‘Document results,’ including confidence scores for page classification and data extraction.

Learn more>>

10. Query based extraction

Amazon Textract simplifies data extraction from documents using natural language queries. Specify what you need (e.g., “customer name”) and receive precise responses (e.g., “John Doe”). No need to understand document structures or worry about variations. Textract Queries are trained on diverse documents, reducing post-processing, manual reviews, and ML model training needs.

Learn more »

11. Handwriting recognition

Many documents, such as medical intake forms and employment applications, include both handwritten and printed text. Amazon Textract can extract both from documents written in English with high confidence scores, whether the text is free-form or embedded in tables. Documents can also contain a mix of typed text and handwritten text.

Learn more »

12. Invoices and receipts

Invoices and receipts can have a wide variety of layouts, which makes it difficult and time-consuming to manually extract data at scale. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts and automatically extracts relevant data such as vendor name, invoice number, item prices, total amount, and payment terms.

Learn more »

13. Identity documents

Amazon Textract employs ML to understand identity documents (e.g., passports, licenses) without templates. It extracts data like expiry and birthdate and identifies name and address. This aids ID verification, automating tasks like account setup, appointment booking, and applications, using customer-submitted identity document scans or photos.

Learn more »

14. Adjustable confidence thresholds

Amazon Textract provides confidence scores for identified information in documents. You can set thresholds based on these scores for different document types. For critical documents like tax records, consider a high threshold (e.g., 95%) for human review. For less critical documents like resumes or archives, you can use lower thresholds for processing

Learn more »

15. Built-in human review workflow

Amazon Textract is directly integrated with Amazon Augmented AI (A2I) so you can easily implement human review of printed text and handwriting extracted from documents. Many text-extraction applications require humans to review low-confidence predictions to ensure the results are correct, but building human review systems can be time-consuming and expensive. Amazon A2I provides built-in human review workflows so you can review predictions easily. Choose a confidence threshold for your application, and all predictions with a confidence below the threshold are automatically sent to human reviewers for validation. You can also specify which key-value pairs should be sent for human review and configure A2I to send randomly selected documents for review as well. Use a pool of reviewers within your organization or access the workforce of over 500,000 independent contractors who are already performing ML tasks through Amazon Mechanical Turk. You can also use workforce vendors that are pre-screened by AWS for quality and adherence to security procedures. To learn more about implementing human review workflows, see the Amazon A2I website and Amazon A2I Integration with Amazon Textract in the developer guide.

16. Amazon Textract pricing

A machine learning (ML) service for text, handwriting, and data extraction from scanned documents, including forms and tables. Pay-as-you-go pricing model with no minimum fees or upfront commitments. Charges apply based on the number of pages processed, whether for text, tables, forms, queries, or document processing. Refer to the FAQ for more details on page limits and usage policies.

17. Scalability and Integration:

Being a part of AWS, Textract seamlessly integrates with other AWS services like S3, Lambda, and Step Functions. This means you can easily scale your document processing workflows to handle large volumes of documents efficiently.

18. Security and Compliance:

Textract adheres to AWS’s robust security standards and compliance certifications, ensuring that sensitive data remains protected throughout the document processing pipeline.

Use Cases for Amazon Textract

Amazon Textract finds applications across various industries and domains. Here are some notable use cases:

1. Document Digitisation:

Businesses can use Textract to convert paper-based documents into digital formats. This is particularly valuable for transitioning to paperless operations and enabling efficient document retrieval.

2. Content Search and Analysis:

Textract enables organisations to search and analyse large volumes of documents quickly. This is crucial for compliance, legal, and research purposes.

3. Invoice Processing:

Amazon Textract streamlines invoice processing by extracting key invoice details like invoice numbers, due dates, and line items. It returns dates as detected, supporting various date formats.

Block objects labeled KEY_VALUE_SET contain linked text information, with EntityType distinguishing KEY from VALUE. A KEY block references associated VALUE blocks and child WORD blocks that compose the key’s text.

VALUE blocks contain text associated with a key and reference child WORD blocks representing individual words. Textract assigns a confidence value to KEY_VALUE_SET pairs but provides a separate confidence value for WORD blocks.

KEY_VALUE_SET Block objects are child elements of PAGE Block objects, each corresponding to a page in the document.

4. Forms and Surveys:

Textract can automatically extract data from surveys, questionnaires, and application forms. This is valuable for market research and customer feedback analysis.

5. Healthcare Records:

In the healthcare sector, Textract can be used to extract patient information from medical records, improving accuracy and efficiency in healthcare operations.

Benefits of Amazon Textract

Embracing Amazon Textract brings several benefits to organisations:

Efficiency:

Automated document processing reduces manual data entry and minimizes human errors, resulting in increased efficiency and productivity.

Cost Savings:

By automating document extraction and analysis, organisations can save on labor costs associated with manual data entry and processing.

Accuracy:

Textract’s machine learning capabilities ensure accurate data extraction, even from handwritten documents, leading to more reliable data.

Scalability:

Textract can handle massive document processing workloads, making it suitable for businesses of all sizes.

Enhanced Insights:

By extracting structured data from documents, Textract enables better data analysis and decision-making.

Getting Started with Amazon Textract

To start using Amazon Textract, you can follow these general steps:

1. Sign Up for AWS: If you’re not already an AWS customer, sign up for an AWS account.

2. Access Amazon Textract: Navigate to the AWS Management Console, locate the Textract service, and configure it to suit your needs.

3. Upload Documents: Upload your documents to an S3 bucket or use other available integration options.

4. Process Documents: Use Textract’s APIs or SDKs to process and extract data from your documents.

5. Integrate and Analyze: Integrate the extracted data into your applications or analyze it using other AWS services.

Sample Textract Use-case

Problem:
Lets us say we need to build a pipeline which takes in documents automatically and store them in digitalise format. These documents would be multiple invoices from vendors.

Solution:

We will be using the following for this use case of AWS Textract service using AWS Lambda with Python implementations

  1. Simple Storage Service (S3)
  2. Identity Access Management Service (IAM)
  3. Lamda Service
  4. Textract Service

The document processing pipeline operates as follows:

Initiation of Document Analysis: The process commences when a message is dispatched to an SQS queue, signalling the initiation of a document analysis.

Job Scheduler Lambda Function: A dedicated Lambda function, acting as a job scheduler, executes at predefined intervals (e.g., every 5 minutes). This Lambda function polls the designated SQS queue for pending messages.

Submission of Textract Jobs: For each message retrieved from the SQS queue, the job scheduler Lambda function initiates an Amazon Textract job to process the associated document. This process is repeated until the maximum limit of concurrent jobs for the AWS account is reached.

Document Processing Completion Notification: Upon completion of processing, Amazon Textract sends a notification to an SNS topic.

SNS Topic Triggers Job Scheduler Lambda: The SNS topic triggers the job scheduler Lambda function, prompting it to initiate the next set of Amazon Textract jobs.

SQS Message for Result Retrieval: Simultaneously, SNS dispatches a message to a designated SQS queue.

Lambda Function for Result Retrieval: Another Lambda function is tasked with processing the message in the SQS queue.This Lambda function interacts with Amazon Textract to retrieve the results and subsequently organizes them into relevant folders within the specified S3 bucket

Workflow:

Complete Step by step guide could be followed in the Reference Section Link as “StepbyStep Pipeline Generation

Conclusion

Amazon Textract is a groundbreaking AWS service that simplifies document extraction, analysis, and digitization. By harnessing the power of machine learning, Textract offers businesses a way to automate document processing workflows, increase efficiency, and gain valuable insights from their data. Whether you’re dealing with invoices, contracts, healthcare records, or any other type of document, Amazon Textract can help unlock the hidden information within. Start exploring this powerful tool today and transform the way you handle documents in your organization.

References

Textract Demos
Amazon Textract Developer Guide :Detailed instructions-
Learn more
Getting Started with textract Documentation
How Amazon Textract Works!
Click here for Tutorials
StepbyStep Pipeline Generation

Thank you for reading this complete blogpost. I will be coming up with new posts soon on Gen-AI. Feel free to connect with me on Linkedin for any further questions or doubts. Checkout my recent paper on Gen AI Paper

--

--