Smart Data Extraction [DEM]

Data Science — Part 2

Published in

Data Reply IT | DataTech

8 min readFeb 22, 2023

With the increment of the amount of data inside documents of different nature (financial, medical, compliant and other types of documents) there is the need to effectively store, digitize and organize the information contained therein. Indeed, the goal of this project is exactly to save hours of manual data entry and to reduce the human factor in digitizing bulk documents.

With a team of data engineers and data scientists, Data Reply IT has created a workflow that automates all the processes involved in the digitation returning a usable, cloud stored and portable digital version of the document: DEM helps in transforming structured, semi-structured and unstructured data from a variety of document formats into usable information.

Architecture

The cloud platform used for this purpose is Amazon Web Services (AWS) offered by Amazon. AWS provides a mixture of infrastructure as a service (IaaS), platform as a service (PaaS), and packaged software as a service (SaaS).

Below the solution to face our problem is shown:

As seen in the architecture, differently services are used to solve our scope, grouped by their usage in a particular context:

Archives like S3 Buckets and DynamoDB Table
Event-Driven services like AWS Lambda, AWS EventBridge, AWS SNS, and AWS SQS
Machine Learning services like AWS Textract and AWS Comprehend
Data Engineering services like AWS Step Function and AWS Glue
Query service tools like AWS Athena

The usage of this workflow follows a straight way that is possible to subdivide into a few ordered steps:

The Ingestion of documents
The Asynchronous calls to AWS Textract to extract document texts
The Initialization of the AWS Step Function
The Asynchronous call to AWS Comprehend to get the classification of a batch of documents
The Asynchronous call to AWS Comprehend to get the Key entities or phrases of a batch of documents
The ETL phases made up of specific ETL Jobs
The Crawling step and the possibility to retrieve these data by AWS Athena

Each step uses a collection of AWS services. As already said in this article we are focusing more on describing and discussing the usage of machine learning services which are introduced in these steps.

1. Data Extraction on un-labeled documents

Having ingested a pool of un-labeled documents, we want to extract their texts to have their classifications performed in the next steps.

To extract these texts AWS Textract is used: a powerful tool that extracts texts, handwriting, and data from documents.

We can get the texts in 30 seconds by linking the proper S3 Bucket which redirects the final result packaged by the ML service. The result of this service is a series of JSON responses. In this case, we made different files where we saved the information, in particular:

Two files were concatenated line by line the texts in an ordered and unordered fashion
A file for having a representation of a table
A file for storing all of these JSON responses (Textract outputs)
A file for having a representation of a form

2. Data Classification

Starting the execution of the AWS Step Function we have a pool of documents without any label over them. This is a critical step that doesn’t allow us to extract the proper information inside each “unsupervised” document.

To solve this problem an ML service called AWS Comprehend is involved. A natural language processing service that uses ML to discover insights from text. AWS Comprehend allows us to insert a custom dataset that will be used to automatically train the model behind the service.

With particular settings behind AWS Comprehend, we can get the inferences in 5/6 minutes by linking the proper S3 Bucket which redirects the final result packaged by the ML service. The result of this service is a tar file zip where there are for each “unsupervised” document, the predicted label and its score of confidence on the result obtained.

3. Data Extraction drove by data classification

When the documents get their predictions, it is possible to experience another usage of AWS Comprehend to extract key phrases or key entities in each document.

In our case:

if the document is predicted as a not “CONTRACT” it is possible to extract the key entities
if it is a “CONTRACT” type then it is possible to extract key phrases

each extra information is saved in the stageData bucket.

4. Data Modelling (ETL)

Having this amount of information until now allow us to model the data pretty well for creating, in each kind of document treated, a dataset with an amount of information.

For this step parquet files are created and at the end, a crawler, set up inside AWS Glue, extracts all these parquet files of each category/class of documents to build tables with interesting fields.

To create these tables we need to have many parquet files. Each file contains a record with a number of prefixed fields corresponding to the amount of information retrieved inside a document. To make this possible, the files created in the previous stages are considered and the weighted distance similarities are computed using Levenshtein and Jaro distances.

With the matching information, we have different score distances between 0 and 1 and the scores are filtered using a threshold. We set that threshold to 0.85 as a fine-tuned parameter. At the end of the workflow, the documents are considered MODELED.

An overview of these ML services

With the following subchapters, we want to discuss more in deep about the ML services involved in this project:

1. AWS Textract

Amazon Textract uses a model that applies OCR techniques to identify and extract data also from forms and tables. In a few seconds is possible to have the results. The results are given in JSON format.

In this project the recursive lambda “extractionInitializer” executes different calls to Textract to get for each document the data of interest, an example of an asynchronous call is represented below:

textract = boto3.client('textract')
startanalysis = textract.start_document_analysis( 
                DocumentLocation = {
                    'S3Object': {
                        'Bucket': 'your-aws-s3-bucket',
                        'Name': 'your-document-name'
                    }
                },
                FeatureTypes = [
                    'TABLES','FORMS',
                ],
                JobTag = 'jobid',
                NotificationChannel = {
                    'SNSTopicArn': "your-arn-aws-code",
                    'RoleArn': "your-arn-aws-role"
                }  
            )

As seen in this little script, we aim to start an async call to AWS Textract, to get its responses over the texts inside each document, specified inside the “DocumentLocation” property, and in particular extract texts over tables and forms. When AWS Textract is ready to give us the responses, then these are all redirected to an AWS service called Simple Notification Service (AWS SNS) that encapsulates the results inside a queue (AWS SQS).

The type of message is the response of AWS Textract which corresponds to this output:

{
    'DocumentMetadata': {
        'Pages': 123
    },
    'JobStatus': 'IN_PROGRESS'|'SUCCEEDED'|'FAILED'|'PARTIAL_SUCCESS',
    'NextToken': 'string',
    'Blocks': [
        {
            'BlockType': 'KEY_VALUE_SET'|'PAGE'|'LINE'|'WORD'|'TABLE'|'CELL'|'SELECTION_ELEMENT'|'MERGED_CELL'|'TITLE'|'QUERY'|'QUERY_RESULT',
            'Confidence': ...,
            'Text': 'string',
            'TextType': 'HANDWRITING'|'PRINTED',
            'RowIndex': 123,
            'ColumnIndex': 123,
            'RowSpan': 123,
            'ColumnSpan': 123,
            'Geometry': {
                'BoundingBox': {
                    'Width': ...,
                    'Height': ...,
                    'Left': ...,
                    'Top': ...
                },
                'Polygon': [
                    {
                        'X': ...,
                        'Y': ...
                    },
                ]
            },
            ...
          }    ...
}

The interesting field over this JSON response is the “Blocks” property which gives all the information regarding the texts extracted.

Which model is used?

For commercial purposes the model used by this ML service cannot not shared.

Why is used here?

AWS Textract is an important service for this workflow because it can give, in a few seconds, the interesting data for the customers and for the next steps to represent each document with the corresponding interested data values for business scopes.

Pro & Cons

Essentially, there are other pros and cons to using Amazon Textract:

Pros:
- Easy Setup with AWS services
- Secure
Cons:
- Inability to Extract Custom Fields
- Integration with upstream and downstream providers
- No Fraud Checks
- No Mixture of Text Extraction (Vertical or Horizontal)
- Language Limit, only a few languages are ably extracted

AWS Comprehend

Amazon Comprehend could provide custom functionalities like entity recognition, classification, key phrase extraction, sentiment analysis, and other API functions.

With the help of different lambda functions inside the AWS Step Function it is possible to leverage the API calls to this service. The API answers are sent in JSON format.

The API calls, at this step, are based on asynchronous mode and created using the boto3 framework, as seen before. For example, considering the lambda “startClassificationDocument” an asynchronous call could be:

comprehend_client = boto3.client('comprehend')comprehend_client.start_document_classification_job(
        JobName= '%x' % random.getrandbits(32),
        DocumentClassifierArn='your-comprehendclassifier-arn',
        DataAccessRoleArn='your-arn-role',
        InputDataConfig={
            'InputFormat': 'ONE_DOC_PER_FILE',
            'S3Uri': 'your-s3-uri-input'
        },
        OutputDataConfig={
            'S3Uri': 'your-s3-uri-output'
        }
    )

As seen in this little script, we aim to start an async call to AWS Comprehend, to get its responses over the predicted label of each document. The data are stored inside the “S3Uri” value in the “InputDataConfig” and their outputs are put in “OutputDataConfig” path.

The result is a tar.gz file that contains for each document of the batch the label prediction. The corresponding file is constructed by “getClassificationDocument”.

Which model is used?

For commercial purposes the model used by this ML service cannot be shared.

Why is used here?

AWS Comprehend is used to have the classification of the documents and the key phrases or entities respecting the type of classification that a document could have. In this case, if you know the type of classification of a document it is possible to better handle the following steps and also for having the proper extraction in the ETL phases.

Pro & Cons

Essentially, there are other pros and cons to using Amazon Comprehend:

Pros:
- Great NLP Tool
- Secure
Cons:
- Price
- Language Limit

Conclusions

In conclusion, we have realized a cloud workflow capable of elaborating a huge number of documents for which the results are given in a few minutes. As seen in the architecture, these ML services are services that could be used instantaneously with proper precautions and can give a good accuracy result that doesn’t needed to be trained by on-premises models.

Smart Data Extraction [DEM]

Data Science — Part 2

Architecture

1. Data Extraction on un-labeled documents

2. Data Classification

3. Data Extraction drove by data classification

4. Data Modelling (ETL)

An overview of these ML services

1. AWS Textract

AWS Comprehend

Conclusions

Written by Jeremy Sapienza