Write PDF data to a relational database using Amazon Textract

Published in

The Startup

6 min readDec 23, 2019

Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract also identifies the contents of fields in forms and information stored in tables. In this demo project I’ll use Textract to read data and identify tables in PDF documents. These PDF documents represent invoices. The data will be written to an RDS Postgres database. It will make it possible to create a relational database with invoice data from PDF’s. This will give us the opportunity to query the data and collect useful information about the invoices.

Below you can find a high level representation of our setup. We will create an RDS Postgres DB instance. We will use a CloudFormation template and an SQL script for this. Next we will deploy the rest of the stack using a SAM template.

Prerequisites:

AWS account
SAM CLI + AWS CLI + PSQL CLI
S3 Bucket (SAM will deploy the package to this bucket)
Git
Docker (optional)

All templates and lambda code is stored in this Git repository. Clone this repository on your local machine.

The first step is to create an RDS database. In this example I will create a Postgres datbase. I can deploy it by creating a CloudFormation stack using the db/postgres.yml template from the repository. It will create a small RDS instance and open port 5432.

$ aws cloudformation create-stack --stack-name db --template-body file://db/postgres.yml --parameters ParameterKey=Username,ParameterValue=lvthillo ParameterKey=Password,ParameterValue=notsupersecret

This takes a few minutes. Check the outputs section in the AWS CloudFormation console to get the connection URL. Now we have created the InvoiceDB database but we still have to create the invoices table.

We will create this table and also create an additional demouser which uses IAM as authentication method. This user will be used by our ‘’Write”- lambda to make a secure connection to RDS.

CREATE USER  demouser WITH LOGIN; 
GRANT rds_iam TO demouser;
CREATE TABLE invoices (
    DATE date,
    DESCRIPTION varchar(255),
    RATE decimal(4,2),
    HOURS decimal(3,2),
    AMOUNT decimal(5,2)
);
GRANT ALL PRIVILEGES ON TABLE invoices TO demouser;

We can use the db/setup.sql script from the Git repository. Use the connection URL which was outputted by the CloudFormation stack.

$ psql -h dddhg3vj3xdeo8.ccayipyfaeny.eu-west-1.rds.amazonaws.com -p 5432 -U lvthillo -d InvoiceDB  -W -a -f db/setup.sql

We have created an invoices table in our RDS InvoiceDB. The table is new and empty. We can use pgAdmin to visualize our changes. You can skip this step if you have a DB visualizer installed locally. I don’t have it so prefer to start a Docker container of pgAdmin.

$ docker run -d -e PGADMIN_DEFAULT_EMAIL=my@mail.com -e PGADMIN_DEFAULT_PASSWORD=mypasswd -p 80:80 dpage/pgadmin4:4

Open localhost and connect to our PostgresDB.

We will use SAM To deploy our Textract stack but let’s first go into a bit more detail about what resources and what relationships will be created.

We will create an S3 bucket in which we will upload our PDF invoices. Every upload will trigger the Read lambda. This lambda starts the asynchronous analysis of an input document, in our case a PDF document from S3. The async operation allows us to process a multipage document.

The PDF data is available here. The invoice contains data about what work is done and when it’s done. Also the rate and total amount is added in the PDF.

It’s also important to note we use a multipage PDF

When text analysis is finished, Amazon Textract publishes a completion status to the Amazon Simple Notification Service (Amazon SNS) topic that you specify in NotificationChannel.

Here is some relevant code snippet from the Read lambda where we start the document analysis based on the document location. The result is published to SNS.

The Write lambda is subscribed to the SNS topic. The Write lambda checks if the pdfTextExtractionStatus is SUCCEEDED. If so, then the lambda will get the job results from Textract. This Id was published to SNS. In the end the lambda will write the data to the RDS Postgres instance. The lambda will use IAM role-based authentication to connect to RDS. We don’t have to store and maintain credentials for database authentication.

To deploy the stack described above (except the RDS instance which is already deployed):

$ sam build# Change the bucket to your personal bucket
$ sam deploy --template-file .aws-sam/build/template.yaml --parameter-overrides DemoUser=demouser TableName=invoices --stack-name textract --capabilities CAPABILITY_IAM --s3-bucket demo-lvthillo-bucket

Check if the stack deploys well. After a succesful deployment we can perform an initial test. Go to the S3 console and upload the invoice_january.pdf.

After the upload the Read lambda will be triggered and will publish the results to SNS. The Write lambda will pick up these results and write them to RDS. We can check if our lambdas ran correctly:

The results are published to SNS. Let’s verify if our Write lambda subscribes to the SNS topic.

The Write lambda subscribes to the SNS topic

The logs from the Write lambda confirm that the Textract analysis ran successfully and that our data was written to the RDS database.

At last we can retry to query our data in our pgAdmin console. I just query the first 10 rows from the invoices table.

We can write more useful queries now. We can for example query how many hours we worked on CI/CD this month.

Or we can query how much money we made by working on Python Bug Fixing or Java Development.

Optionally you can restrict it for January too (although there is no invoices PDF for February in this demo)

That seems to work! Now complete these steps to clean up the environment:

Empty the textract-stack-upload-bucket (and optionally your SAM bucket).
Delete the CloudFormation stack ‘textract’ first.
Delete the CloudFormation stack ‘db’.

We started this demo by creating an RDS database and after that we created a table using an SQL script. We also created a user and used IAM as authentication method. Next we used SAM to deploy a bucket and two lambdas and an SNS topic. The first lambda functions as reader (reads PDF data from S3) and the second lambda functions as writer (writes PDF data to RDS). It’s also important to remember our data was written from a multipage PDF. We used the asynchronous Textract API to process the documents.

I hope you enjoyed this article! Feel free to ask questions in the comments.

Write PDF data to a relational database using Amazon Textract

Written by Lorenz Vanthillo