Extract text and data from any document using Amazon Textract in Node.js

4 min readJan 1, 2020

Amazon Textract is a service that automatically extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

A common use case is where you need to extract data from documents and forms. Thanks to machine learning , this service could eliminate the need of a manual data-entry, or less error-prone approach of hard coded rules of reading a form.

A real world scenario

Our target is to read an image form and extract all the text in a meaningful way, here’s an example of a form:

Input form image:

An ideal output would be:

{
   "Name": "Mark",
   "Surname": "Saulman",
   "Gender": "MALE",
   "Birth City": "Varese",
   "Birth Country": "Italy",
   "Birth Date": "01-10-1991",
   "Passport Date": "30-12-2019",
   "Passport Expiry Date": "01-10-2022",
   "Passport Issued by": "Rocket Moon",
   "Mobile": "39 346 0000000",
   "Mail": "mark@example.com",
   "Signature": "",
   "Airport": "MXP",
   "Arriva Date": "01-01-2020",
   "Insurance Expiry Date": "02-02-2020",
   "Hotel": "Central Hotel"
}

With traditional OCR solutions, keys and values are extracted as simple text,
which makes it tricky to map with hard-coded rules and to maintain that for each update.

Luckily Amazon Textract provides a Form Extraction feature, meaning you can detect key-value pairs in an image document (even pdfs) automatically, so that you can retain the inherent context of the document without any manual intervention.

An Example

1- Command line app

Let’s create a small command line app that scans a file and outputs a JSON result.

$ mkdir textract-lab$ cd textract-lab && yarn init$ touch index.js$ yarn add commander aws-sdk lodash

We have created a directory and initialized our node.js app, created an entry file index.js , added our command line helper commander plus the sdk of Amazon web services (aws-sdk), and lodash.

Next open up index.js and paste the following code:

const program = require("commander");
program.version("0.0.1").description("Textract Lab");
program
.command("scan <filePath>").alias("s").description("scans a file").action(filePath => {console.log(filePath);});program.parse(process.argv);

We have added a command that accepts one argument, which is the filePath of the file that we want to scan, for now it just logs the filePath we have passed, try to execute it:

$ node index.js scan helloworld

It should logs a helloworld

2- Setting up your amazon account

You’ll need to Set Up an AWS Account and Create an IAM User. Follow the official AWS textract guide. Make sure that you have configured an IAM User to use textract.

3- AWS Config

Create a config file that contains your aws configuration:

$ touch config.js

Open config.js and paste the following (insert your keys plus the region)

module.exports = {awsAccesskeyID: "",awsSecretAccessKey: "",awsRegion: ""};

4- Textract Scanner

I have created a helpful scanner utility that accepts a buffer and returns formatted JSON with FORM EXTRACTION capability.

textractUtils.js

As you can see in the end of the file, we have exported our main function that accepts a buffer, the rest of the functions are key-value pairs extractors and formatters, for more details see the guid lines of amazon.

copy that into the root of the project.

Next edit index.js as follows:

const program = require("commander");const fs = require("fs");const textractScan = require("./textractUtils");
program.version("0.0.1").description("Textract Lab");program.command("scan <filePath>").alias("s").description("scans a file").action(async filePath => {var data = fs.readFileSync(filePath);const results = await textractScan(data);console.log(results);});program.parse(process.argv);

5- Testing

Execute the following, and pass an image of a form, or anything else (PDF or and Image)

$ node index.js scan /path-to-your-file/form-example.png

You should see in the console something like (if you have passed the example image I have shared in the beginning of the article):

{'Passport Issued by': 'Rocket Moon','Birth City': 'Varese',Gender: 'MALE','Passport Expiry Date': '01-10-2022','Birth Country': 'Italy','Passport Date': '30-12-2019','Arriva Date': '01-01-2020','Birth Date': '01-10-1991',Surname: 'Saulman',Airport: 'MXP','Insurance Expiry Date': '02-02-2020',Mobile: '39 346 0000000',Signature: '',Mail: 'mark@example.com',Name: 'Mark',Hotel: 'Central Hotel'}

Conclusion

Amazon Textract is based on the same proven, highly scalable, deep-learning technology that was developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. You don’t need any machine learning expertise to use it. Amazon Textract includes simple, easy-to-use APIs that can analyze image files and PDF files. Amazon Textract is always learning from new data, and makes it much easier to extract form information out of an image.

You can find the example project on Github.