Extracting Expense Document Information Using AWS Textract

Using AWS machine learning services to extract expense information

Alejandro Castañeda Ocampo
Globant
5 min readMar 1, 2024

--

Image source: Unsplash

In the ever-evolving landscape of cloud computing, extracting meaningful information from documents has become a critical aspect of various applications, ranging from automation to data analytics. In this article, we’ll explore the fascinating realm of image (expense example) extraction using AWS Textract.

Overview Of AWS Textract

AWS Textract is a fully managed machine learning service by Amazon Web Services, designed to extract text and data from documents. Using advanced machine learning models, Textract converts unstructured data from scanned documents, PDFs, and images into structured, actionable information. Textract has the following features:

  • Leveraging Optical Character Recognition (OCR), Textract accurately extracts printed and handwritten text from images or scanned documents.
  • Understand document layouts, and recognize forms, tables, and structured data.
  • Extracts key information like names and dates associated with specific fields, making it ideal for processing invoices and receipts.
  • Seamlessly integrates with other AWS services through APIs, enabling developers to automate application document analysis.
  • Follows a pay-as-you-go pricing model, making it cost-effective for businesses of all sizes.

One of the most attractive features of AWS Textract is the capability to run queries, helping the users receive specific information from the desired documents. This gives the possibility to the user to ask a natural language question that would start with “What is / Where is / Who is”. In this way, AWS Textract brings the possibility to analyze any type of document in so many contexts that the users want.

The following example shows the interaction in the AWS console using the Textract service to analyze documents with queries:

Example using Textract on the AWS console with queries capability.

AWS Textract supports two ways to process the document analysis, sync process, and async process. Each one of these has its benefits and use cases, so it is important to understand both before starting with an implementation that involves Textract.

The sync process means that Textract can detect and analyze text in single-page documents that are provided as images in JPEG, PNG, PDF, and TIFF format. The operations are synchronous and return results in near real-time.

On the other hand, the async process Textract detects and analyzes text in multipage documents in PDF or TIFF format, including invoices and receipts. Multipage document processing is an asynchronous operation, and it is useful for processing large, multipage documents. The process will run in the background, and the results are delivered to a specific S3 bucket, which sends a notification through SNS when the process finishes.

Regarding costs, Amazon Textract has five different APIs: Detect Document Text API, Analyze Document API, Analyze Expense API, Analyze ID API, and Analyze Lending API, each of which has a different associated pay-as-go pricing model.

AWS Textract simplifies document analysis by leveraging OCR and machine learning, providing a versatile and cost-effective solution for automating data extraction from various types of documents.

Extracting Expense Text Information From An Image

We will start by explaining how to use AWS Textract to obtain information from expense images. On the front side, we have a simple page application (SPA) that allows the user to upload the expense image. The web application will send the uploaded image to the backend via an HTTP request. Then, the lambda function will interact with the AWS Textract service to obtain and return the information. Let’s see the diagram below:

AWS services diagram.

Solution Overview

The diagram above describes a serverless solution to achieve the goal of obtaining the text data from expense documents, following these steps:

We will have a basic web application where users will interact with the presentation layer of the solution:

  1. The user will interact with the web application to upload the expense document using the friendly user interface.
  2. The web application will send an HTTP request to the API Gateway endpoint with the base64 encoded data of the expense image.
  3. The API Gateway will forward the request to the lambda function that contains all the logic to interact with the AWS Textract service.
  4. The lambda function, using boto3 SDK, executes the analyze_expense method, sending the required parameters.
  5. Textract identifies the text parameters and returns them to the lambda function.
  6. The lambda function builds the response and sends it to the frontend layer to display the identified fields by Textract.

Setting Up The Environment

I used a SAM CLI with a SAM template to deploy the services presented in the solution diagram above. Let’s follow these steps:

  1. Download the example repository we will be working with.
  2. To deploy the AWS services on the root directory, execute:

3. After a few seconds, we’ll see the output like it:

Output after applying the stack.

The command above will deploy the AWS infrastructure in the AWS account bound to the AWS local profile. In case you need information about how to configure the CLI profiles, please follow the credentials file configuration documentation.

Test The AWS Textract Solution

Once the infrastructure is deployed, we need to copy into the clipboard the value of the ApiGatewayEndpoint provided by the stack output. Use that URL in the following web page awstextract.awslearn.cloud:

Web page to test the solution.

The image above shows how to obtain the label information from a specific expense image. To achieve this, put the URL of the API gateway on API Gateway URL input, then upload the image (supported only PNG, JPEG image formats). Finally, click on the “Extract information” button. After a few seconds, we’ll see the information detected by AWS Textract service on the right side of the screen.

Conclusions

In summary, AWS Textract emerges as a transformative force in automating document information extraction. Its versatility, seamless integration with the AWS ecosystem, and cost-effectiveness make it a valuable tool for diverse industries. Key takeaways include its efficient automation, accuracy with OCR, and innovative applications across various use cases. With a pay-as-you-go model and continuous improvement through machine learning, AWS Textract offers businesses a scalable solution to streamline document processing and unlock actionable insights from unstructured data. As organizations embrace digital transformation, Textract stands out as a catalyst for efficiency and intelligence in handling the wealth of information embedded in documents.

Thanks for reading; I hope this post has provided valuable insights and resources for enhancing your AWS private connections.

References

--

--