Machine Learning, Documents and MINT.extract

turicode Inc.
Jul 10 · 3 min read
MINT.extract in production mode

Recently turicode has launched the technology MINT.extract as a standardized product. Unstructured information can be extracted independently with a few clicks by the customer. The three license models “Quick Start”, “Professional” and “Enterprise” offer multiple functionalities to cover the different needs of small companies as well as bigger corporates. We asked Aaron Richiger, our machine learning expert and research lead, a few questions about the use of machine learning in MINT.extract.

Aaron, how does turicode use machine learning to digitize documents?
We use machine learning in different steps in the pipeline from semi- or unstructured documents to the structured machine-readable output. Often it is the documents’ complexity which forces us to make use of new technologies like machine learning. For example, changing layouts in documents bring rule-based approaches to their limits, therefore, we need a more flexible approach to be able to fulfil all requirements. In this case, we or our customers train a self-learning system to extract the relevant information fully automated. The advantages of a machine learning system clearly outweigh the more conventional methods like template- or rule-based solutions in this regard.

Are there any disadvantages when using machine learning?
A pure machine learning based system will never reach 100% output quality, just as humans are not able to deliver perfect results over a longer period of time. The aim is to achieve a comparable extraction quality. With the help of well-designed validation rules or warnings, we are able to identify potential errors and present them to the user for visual inspection. This way, we achieve a significant increase in efficiency over the manual capturing with close to 90% less processing time.

Usually one needs a lot of data to train a machine learning system. How is turicode coping with this?
Yes exactly, often the amount of required data to get decent results is a big hurdle to get started. With machine learning, the garbage in, garbage out rule holds very true. Therefore, we have put a lot of research into reducing the size of the required amount of data. Today, a handful of documents per class can be enough to train a well-functioning system, depending on the document type. The rule of thumb states that the more complex a document type is the more example documents are needed, but it no longer requires thousands of documents. Less than a hundred of representative sample documents are not only easier to collect, but often deliver better quality rates.

Why do companies extract their data?
The manual typing up of information is an error-prone and boring task. Especially, if employees are occupied with it every day, it makes sense to automate this task and drive digitalization forward. Moreover, businesses would not tackle big document corpus unless they find a way for an automated approach given the manual effort involved. As an outcome of our work we observe that customer satisfaction and employee satisfaction go hand-in-hand. The reaction time to customer requests is shorter and the answer often of higher quality. At the same time, employees can dedicate their time to more interesting and complex tasks as part of their daily work.

For more information about the topic or if you have a specific project in mind, please do not hesitate to contact us. Have a look at our website (www.turicode.com) or write us an email (info@turicode.com)