Data Extraction & Validation with AWS (ML/AI) Services — PoC

Dennis Brysiuk
NEW IT Engineering
Published in
3 min readApr 29, 2022
Photo by Hitesh Choudhary on Unsplash

Basically Maschine Learning and Artificial Intelligence Algorithms implementation requires a lot of effort and distributing the task over a large number of human workers. The train and deploy of DL models are also a complex and long-lasting process. In addition, there are the enormous development costs.

What, if you are not a Data Scientist and/or the customer has no training data but want in a short time an Proof of Concept / Prototype?

There is always a solution to every problem. AWS provides many services that already contains ML/AI mechanisms and with a little creativity and the powerful tools, a PoC / Prototype can be implement cheaply, quickly and also without deeper Data Scientist knowledge.

Use Case

One of our customers processes more that 42.000 application processes every year. An application process takes about 80 minutes and can contain over 20 different document types. One of credit officers can handle up to 1.400 processes per year with a rework ratio up to 20%.

Solution

The architecture design of a high-level solution would look like this:

Architecture Solution Blueprint

It contains DL Models for OCR and classification pipeline with active learning mechanism. The Human Labeling pipeline with innovative algorithms and user experience techniques in order to improve the accuracy of the human labeling. Also including consolidation algorithms to eliminate the errors or bias of individual workers. Continuously train and deploying models to become more capable of automatically classification with each iteration.

🤯🤯🤯

Wow — let´s take it easy and for the first reduce the use case complexity. What when we just implement a simple two path output infrastructure where the user can upload, trigger, verify and correct the processing for at least two different document types using available AWS Services

Prototype

With JavaScript and Node.js we can implement a simple Web-UI for uploading documents to S3 Bucket, verifying results and trigger the text extraction and valadion. Then we can use a Lambda function to trigger text extraction, auto labeling, human verification and saving the results on the database:

  • Auto labeling we can implement in Python with simple search algorithms
  • Text extraction pipeline using AWS Textract
  • Human verification with Amazon A2I
  • Data persistence layer on NoSQL DynamoDB
Prototype Architecutre

and VOILÁ prototype is implemented.

Archievements

We have confirmed in a short time that the PoC works and already with that prototype we could reduce the processing time to less than 5 minutes.

--

--