It has become common knowledge that a good user experience is key to any great product. In this third part of our series “Documents to Value”, we will highlight what the user experience looks like when using MINT.extract. So far in this series, we have looked at the business side of our solution as well as how turicode makes use of machine learning methods to improve the results for our customers.
In the following, we will illustrate what exactly this initial training entails through a step by step walkthrough.
1 — Define the relevant information: A domain expert needs to clearly define what information needs to be extracted from the documents and in what format. If this information changes depending on the documents, then this needs to be defined as well. In the example of a purchase order, we can see that the item number, the quantity and the total price are important. We also see that the price per item is not needed. Further, it has been defined exactly what kind of parameters the label total can have. For example, that it must be an integer, and therefore, cannot contain letters. In the next step, the relevant documents need to be selected.
2 — Get 20 sample documents per all possible cases: For each layout type there need to be around 20 documents which are representative of the real document set. Good training data is indispensable to achieving good results. As mentioned in the previous article, the motto in machine learning is “garbage, in garbage out”. Once a set of representative documents are selected, the labelling and training process can start.
3 — Label documents and train the system: The domain expert can now label each selected document directly from within the user interface by selecting the desired information on the original document and giving each data item one of a set of pre-defined labels. Based on this labelling, the machine learning system can be trained. The illustration highlights this annotation process for business users. For each document, the domain expert draws rectangles containing the respective information and tagged with the corresponding label. Following this manual labelling, the model is trained automatically to predict all labels within the remaining documents.
4 — Correct the predicted labels: After the first round of training, the system needs to be evaluated and incorrectly or incompletely predicted labels can be corrected, followed by a re-training of the system. It is often advantageous for this to be an iterative process: The more feedback MINT.extract receives from domain experts, the smarter its data extraction becomes with every new document. In the example, the total was falsely labelled as item number. After the correction process, the document can be added to the training sample and the model is re-trained to include this feedback. Once the desired accuracy is reached, the system can be integrated into existing business processes.
The process of manual labelling can be complemented by fixed rules, where needed. This can be the case if the classifier is not delivering satisfying results because of sparse data or layout changes. After consulting with the client, we formulate and implement these rules and integrate them into the AI-based system.
From start to finish, a new system can be put into place within days, given all the parameters — including a clean specification of the valuable data — are clearly defined. Once the system is in place and running, new document types can easily be added through the above-mentioned steps. This allows our clients to grow the set of document types that can be automatically processed by our service, yielding structured data in the desired output format. With this system in place, employees do not have to spend valuable time manually copy-pasting information into a business process.
While this approach rids employees of boring, repetitive tasks, it does not eliminate human involvement completely. The expert knowledge is required during the training phase as well as afterwards in the validation phase. Ultimately, only a human can assess and improve the quality of an automated process.
In our last part of this “Documents to Value” Series, we will elaborate on the missing piece of how to integrate the data extraction service seamlessly into the overall IT infrastructure.
If you would like to learn more about our platform, just write Martin an email! firstname.lastname@example.org