In the fourth article of our series “Documents to Value”, we will take the time to outline some of the best practices of information retrieval from documents from an IT architecture point of view. Through different projects, we have learnt that there are four critical factors which should be considered for a successful integration of a machine learning system into an existing technical landscape.
When setting up a new data extraction service, our customers typically start with 1–3 use cases to be covered. However, once the service is in production, new visions emerge, and often additional use cases should be processed on the same application. Hence, a comprehensive solution can be extended into two dimensions. Firstly, the architecture needs to extend vertically to accommodate new document types (e.g. purchase orders, shipping receipts, bank statements) flexibly and without programming skills. Secondly, more services can be added to the overall pipeline (extending it horizontally) to increase the overall value and user experience. Typically, such services are Optical Character Recognition (OCR) to transform scans into PDFs, classification services to split different document types and separate relevant from non-relevant information.
The modular setup of MINT.extract allows customers to tailor the solution to their current and future use cases. It is possible to extend the horizontal pipeline with the above-mentioned pre-processing steps as well as post-processing possibilities like validating against databases (both internal and external) or enriching the structured data with additional information. Additional data points can be extracted from the original documents rapidly and at any time. For example, at the beginning only address and customer number are of interest, in a later step, order date and reference number become important as well, which requires to extend the scope of data extraction vertically.
This flexible and incremental approach allows thus to bring new use cases faster into production and immediately benefit from efficient processes and value-enhancing services.
The level of integration depends mainly on the importance of the use case (core vs periphery), the prevailing enterprise system and internal technical knowledge. For some cases, we deliver headless services via API for a seamless integration with the client’s infrastructure. In other projects, we provide a web-based interface with a data editor, where users can edit or add information. Naturally, internal databases can be used to automatically validate and enrich the extracted data. With every interaction between our system and the internal knowledge of our customers (both human and database knowledge), our systems are learning and improving over time.
The modular approach is visible on the input and on the output side as well. Documents can be sent to our solution via a dedicated email address or uploaded to a customer-specific webpage, for example. Similar flexibility is given with the return of the structured data, the format is freely definable (JSON, XML, XLSX) as well as the ‘mode of transport’ (e.g. API, email, etc).
For every system, which adds value for the user, good performance is key. Often, real-time processing is a critical factor, especially when there are short feedback loops between users and document. In the second and third article of this series, we described our dedicated machine learning systems from a technical and user experience standpoint. Another critical aspect of such a system is the time it takes to adapt to new documents, refine the outputs and move the updated solution into production.
Our internal architecture is well equipped to handle large quantities of documents. Depending on the pipeline, namely the number of pre- and postprocessing steps, the end to end processing of a page will take some milliseconds to a couple of seconds. As our self-developed programming language DQL is designed for easy parallelization, we can, if necessary, speed up the process even more. As always with parallelization, the tradeoff between the time saved and the complexity of the parallelization task needs to be favourable to the time-saving. Furthermore, one needs to keep in mind that some tasks simply cannot be parallelized, even with a great setup in place.
Finally, another key success factor is to enable end users to influence and control the process through which documents go. Keeping the process as modular as possible allows for great flexibility for the user to extend, adapt and change the process independently to us. Our customers have the possibility to add new document types to the learning system, they can add new input formats and change the output format. The process can be managed over a web-based interface. The authentication process can be integrated with the customers IAM solution over open standards like SAML or OpenId Connect and different reporting functionalities are made available. This gives the customer control over their documents. Further, to improve the machine learning system, the user can feed their feedback back into the learning system and ensure the best possible outcome.
With the right tools at hand, end users steer their document handling similar to how they were in control of manual data entry in the old world. This feeling of control is the last puzzle piece for a successful deployment of a machine learning system into daily use.
A technical architecture that is expandable, integrated, instantaneous and steerable, builds the backbone of a machine learning system to access valuable information from documents. We hope that you have enjoyed reading through our series “Documents to Value”.
If you are curious about whether machine-learning is able to unlock the potential of your documents and what comes next, contact us: email@example.com