Serverless Data Extraction with AWS

crossML engineering
Sep 3, 2019 · 3 min read
Image for post
Image for post
Photo by Taylor Vick on Unsplash

Automated Data Extraction

Text Extraction is the extraction of appropriate text data from an image document. A vast amount of information is available in paper form. Here is a small part of our Hybrid document parsing solution that uses AI to preprocess, extract and parse data.

The following summarizes the process flow:

i. The user uploads the document to an S3 bucket (object-created event).

ii. Amazon S3 detects the object-created event.

iii. Amazon S3 invokes a Lambda function that is specified in the bucket notification configuration.

iv. AWS Lambda preprocesses the document, extract the text using AWS Textract and parses the data into required output format. Output files are exported to an S3 bucket.

AWS Serverless Pipeline

Serverless computing enabled us to eliminate the operational responsibilities from the application. We can build and run the application without thinking about servers. Read further at https://aws.amazon.com/serverless/

Let us create a serverless pipeline to process the documents.

Image for post
Image for post
  1. CloudFront + S3 (Frontend Application Hosting): Amazon CloudFront is a fast content delivery network (CDN) service that can be used with S3 bucket to serve HTML/JS application. For more details please follow https://aws.amazon.com/blogs/networking-and-content-delivery/amazon-s3-amazon-cloudfront-a-match-made-in-the-cloud/

So we have removed the need to deploy and maintain a server to host frontend application.

2. S3 (Document Storage): The client application will upload the document containing the data to S3 (Amazon Simple Storage Service). AWS S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage.

Amazon S3 can publish events (for example, when an object is created in a bucket) to AWS Lambda and invoke your Lambda function by passing the event data as a parameter.

Image for post
Image for post

3. Lambda: AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume — there is no charge when your code is not running. Once an event from s3 is recieved we can call AWS Textract service to run the OCR using boto3 SDK.

All the backend logic for parsing, label mapping and post-processing is implemented using AWS Lambda service.

4. Textract: AWS Textract is an OCR service which leverages machine learning algorithms to detect and extract text and data from a range of document types.

AWS Textract returns words with their respective coordinates, width, and height of the document, which are further manipulated by the AWS Lambda to pick the specific text from the returned words.

5. S3 (Output Storage): Results files are exported to an S3 bucket for the client application to access it.

Conclusion:

To sum up, with Serverless Data Extraction, It is a cloud computing execution model where the cloud provider dynamically manages the allocation and provisioning of servers. Our serverless application runs in stateless compute containers that are event-triggered(Lambda), ephemeral (may last for one invocation), and fully managed by the cloud provider.

crossml

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store