Automating Document Analysis: A Deep Dive into AWS Textract and Data Pipeline

Published in

Ankercloud Engineering

7 min readDec 15, 2023

Welcome aboard! Today, I’m excited to take you on a journey exploring the incredible capabilities of AWS Textract. Imagine a scenario where handling documents and images goes from a chore to a breeze. In this post, we’ll walk through a real-world use case, sharing not just the technicalities, but the ‘aha’ moments that make AWS Textract truly special. So, grab your virtual explorer hat, and let’s uncover the wonders of turning unstructured data into meaningful stories. Ready for an adventure? Let’s dive into the magic of Textract!

Behind the Curtain: Decoding Each Act

As we embark on this technical expedition, consider this your backstage pass. We’ll dissect the Lambda functions, explore the nuances of Glue jobs, and witness the magic of Redshift and QuickSight. But it’s not just about the code; it’s about understanding the why and how behind each decision made on this journey.

Let’s rewind to the initial phases of our project — where the exploration of AWS Textract APIs for receipt processing set the stage.

1. Exploring AWS Textract APIs for Receipt Processing

1.1 API Evaluation:

In the project’s initiation, we meticulously evaluated various AWS Textract APIs to determine their suitability for extracting data from receipts.

1.2 AnalyzeDocument API:

Our initial choice was the “AnalyzeDocument” API, providing results in distinct sections, including raw data, forms, and tables.

1.3 Challenges Encountered:

However, challenges surfaced during this exploration:

Extraction hurdles with Key-Value Pairs.
Complexity in mapping extracted data due to a lack of predefined order.

2. Exploring AnalyzeExpense API

2.1 AnalyzeExpense API:

In response to challenges faced, we pivoted towards exploring the “AnalyzeExpense” API provided by AWS Textract.

2.2 Improved Results:

This alternative proved advantageous, showcasing:

Automatic identification of key information.
Enhanced data extraction, simplifying the process with more structured and meaningful results.

3. Collaborative Decision-Making

3.1 API Selection:

After careful evaluation, a joint decision emerged — to proceed with the “AnalyzeExpense” API for our receipt processing solution.

By opting for the “AnalyzeExpense” API, we ensured heightened accuracy and efficiency in extracting vital information. This aligns seamlessly with our application’s specific requirements, guaranteeing more precise and meaningful data extraction from receipts. Our journey with AWS Textract takes a deliberate turn, enhancing the overall effectiveness of our receipt processing workflow. Ready to witness this decision’s impact as we delve further into the magic of Textract? Let’s continue our exploration!

Lambda Functions: The Magicians of Automation

Our Lambda functions are not just lines of code; they are the conductors orchestrating the symphony of data. Join us as we peer into their logic, understanding how they seamlessly move data between S3 and Textract and later refine the treasure that Textract unveils.

Glue Jobs: Crafting Order from Chaos

Meet the Glue jobs, the transformers of our story. Each job plays a pivotal role — cleaning, extracting, and shaping the data. We’ll unfold the scripts and configurations, sharing insights that go beyond the syntax, explaining why each step matters in the grand scheme of data transformation.

Redshift: Where Tables Tell Tales

Explore the inner workings of Redshift as we delve into the tables born from the Glue jobs. What makes Redshift the ideal host for our structured data, and how does it pave the way for a QuickSight spectacle? The answers lie within the walls of Redshift’s clustered columns.

QuickSight: Crafting Narratives with Data

In the final act, QuickSight emerges as the storyteller, turning raw data into actionable insights. We’ll reveal the steps in creating those ten KPIs, the design choices behind our dashboard, and how QuickSight empowers you to explore the narrative painted by your data.

Your Interactive Guide

This isn’t a monologue; it’s a conversation. Throughout this journey, feel free to pause, ask questions, and experiment with your own data. The code snippets provided are not just lines to copy; they’re invitations to understand and tweak according to your needs.

Ready to make AWS Textract a companion in your data adventures? Let’s roll up our sleeves and dive into the details, unraveling the layers of this captivating data odyssey.

1: S3 → Lambda1

When a photo or PDF is uploaded to an Amazon S3 bucket, a Lambda function is automatically activated. This function’s job is to ensure that if there’s a PDF with multiple pages, it will convert each page into separate images.

2–3: Lambda1 → Textract → S3

The Lambda function handles various inputs like images, single-page PDFs, and multi-page PDFs by converting them into images. These images are then sent to Amazon Textract for expense analysis. The Lambda function, in turn, processes Textract’s response, separating key-value pairs and line items. The extracted information is then stored in JSON files, which are subsequently stored in another Amazon S3 bucket.

You can find the code here → Lambda1

4–5: S3 → Lambda2 → S3

The data stored in the S3 bucket triggers another Lambda function. This second function is crucial for preprocessing the data before sending it to AWS Glue jobs. The Glue jobs, executed towards the end of the Lambda function, work with the preprocessed data. The final outcome of these Glue jobs is then sent back to another S3 bucket, ready for further processing.

You can find the code here → Lambda2

6–7: S3 → Glue Crawler → Glue Catalog

After the data is stored in the S3 bucket, subsequently, a Glue Crawler is employed to create a Glue Catalog table from the stored data. This cataloged table is then processed in Glue jobs, and the final output is sent back to another S3 bucket for further usage.

8–9–10: Glue Job 1 (Key-Value) → Redshift & S3

Using this Glue job we extract key value and lineitems separately from the data. We upload the key_values in the redshift table and send the lineitems data to another bucket for further processing.

Initially we change the schema of the data from just one monolithic json file to a glue dynamic frame.

Once we have key value and lineitems separately we explode the key value array and get the data of type, key and value out of it.

Then the key and type are compared and only the data required for KPI’s is moved into next dynamic frame and values are passed as records.

Then these records are cleaned and manipulated according to our need.

These key value data is then pushed into redshift table directly from gluejob.

Whereas the Lineitem data is then separated into another dynamic frame with session id as its primary key and is pushed to another S3 bucket for further processing before it is eventually send to redshift as a table

You can find the code here → GlueJob1

11–12: S3 → Glue Job 2 (Lineitem) → Redshift

The script begins by importing necessary libraries and initializing AWS Glue and Spark contexts, along with an AWS Glue job.

It utilizes Pandas to load Parquet data from an S3 bucket into a DataFrame, transforming it by reshaping and exploding to handle lists within cells.

Data cleaning is performed, removing commas from the ‘PRICE’ column and replacing NaN and non-numeric values in ‘PRICE’ and ‘QUANTITY’ columns.

The processed data is then converted into a Spark DataFrame. Which is then sent to redshift cluster as a table.

You can find the code here → GlueJob2

13: Redshift

Data from Glue Job 1 and Glue Job 2 it is stored as tables in redshift cluster and Using the relevant columns necessary for the KPI’s a is created which will be used as a DataSet for Dashboard

Key value table

From gluejob key and lineitem extraction we create our first redshift table and upload data in it.