Serverless Data Extraction with AWS

crossML engineering
crossML Blog
Published in
3 min readSep 3, 2019
Photo by Taylor Vick on Unsplash

Automated Data Extraction

Text Extraction is the extraction of appropriate text data from an image document. A vast amount of information is available in paper form. Here is a small part of our Hybrid document parsing solution that uses AI to preprocess, extract and parse data.

The following summarizes the process flow:

i. The user uploads the document to an S3 bucket (object-created event).

ii. Amazon S3 detects the object-created event.

iii. Amazon S3 invokes a Lambda function that is specified in the bucket notification configuration.

iv. AWS Lambda preprocesses the document, extract the text using AWS Textract and parses the data into required output format. Output files are exported to an S3 bucket.

AWS Serverless Pipeline

Serverless computing enabled us to eliminate the operational responsibilities from the application. We can build and run the application without thinking about servers. Read further at https://aws.amazon.com/serverless/

Let us create a serverless pipeline to process the documents.

  1. CloudFront + S3 (Frontend Application Hosting): Amazon CloudFront is a fast content delivery network (CDN) service that can be used with S3 bucket to serve HTML/JS application. For more details please follow https://aws.amazon.com/blogs/networking-and-content-delivery/amazon-s3-amazon-cloudfront-a-match-made-in-the-cloud/

So we have removed the need to deploy and maintain a server to host frontend application.

2. S3 (Document Storage): The client application will upload the document containing the data to S3 (Amazon Simple Storage Service). AWS S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage.

Amazon S3 can publish events (for example, when an object is created in a bucket) to AWS Lambda and invoke your Lambda function by passing the event data as a parameter.

3. Lambda: AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume — there is no charge when your code is not running. Once an event from s3 is recieved we can call AWS Textract service to run the OCR using boto3 SDK.

All the backend logic for parsing, label mapping and post-processing is implemented using AWS Lambda service.

4. Textract: AWS Textract is an OCR service which leverages machine learning algorithms to detect and extract text and data from a range of document types.

AWS Textract returns words with their respective coordinates, width, and height of the document, which are further manipulated by the AWS Lambda to pick the specific text from the returned words.

5. S3 (Output Storage): Results files are exported to an S3 bucket for the client application to access it.

Conclusion:

To sum up, with Serverless Data Extraction, It is a cloud computing execution model where the cloud provider dynamically manages the allocation and provisioning of servers. Our serverless application runs in stateless compute containers that are event-triggered(Lambda), ephemeral (may last for one invocation), and fully managed by the cloud provider.

--

--