Documents to Value: Part II Machine Learning

We are currently at the brink of discovering how document digitization can help us unlock the value trapped inside documents. The technology startup turicode is specialized in data extraction from documents. In the subsequent four-part series of articles, we are going to examine the opportunities and challenges arising from document digitization from different perspectives:

Part I — Business Solution: “Documents to Value”

Part II — Machine Learning — Reduce the amount of training required

Part III — User Experience

Part IV — Software Architecture

Machine Learning — Reduce the amount of training required

Intelligent Machines which can take over many of the more repetitive, boring tasks have been the talk of the last two years. However, it is often unclear how business can really benefit from it, especially because there is a lot of technological knowledge involved as well as massive amounts of data. To be able to show how turicode uses machine learning in the document digitalization process it makes sense to have an example case. Let’s say that you get hundreds of purchase orders a day, and you and your team need to manually extract all relevant information, such as order number, product names and prices. This can be done with a rule-based extraction to a certain extend as described in Part I of this series or it can also be tackled with a machine learning based approach.

In most of our customers cases, we opt for a hybrid approach, combining the best of the two worlds. Rules are great to extract information from documents with more homogeneous layouts and structures given the efficiency and output quality of 100% precision that can be achieved. Moreover, starting off with a rule-based method allows to generate automated labelling of high-quality training data for the machine learning system. However, rules are not very robust when it comes to bigger variation in the input layouts. For such cases, Machine Learning outperforms rule-based methods regarding the efforts required in the initial parameterization as well as for maintenance. Therefore, we consider an incremental, hybrid approach often as best way to go.

This approach enables our engineers to go on more explorative journeys to improve the algorithms we use. The assumption so far has been that the more training data the better the results. After exploring various techniques such as SVMs, logistics regression classifiers and neural networks we achieved positive results for a learning system that can work with as little as 10 to 15 training documents for purchase orders. A topic expert needs to define all the categories of information which need to be extracted from the files. In our purchase order example, this could be things like delivery address, desired delivery date and of course the items which should be delivered. Then the classifier is trained on the labelled data to assign the correct categories to the correct information in the PDF.

F1 score chart for a machine learning system

Our experiments have shown that with the right feature selection and parameterization the quality of the classification with fewer documents is as high as with more training material. The illustration above shows how the F1 score (the harmonic mean between precision and recall) already reaches a very high level with a handful of documents and surpasses the 90% rate with 9 or more documents. These results are important as any business is able to provide a few high-quality documents and chances of ‘noise’ in more documents can be limited. As machine learning systems need good, representative input to learn, it is crucial to reduce all confusing, unclear input data. The motto “garbage in, garbage out” holds a lot of truth in the machine learning realm. Apart from this, being able to train the system with fewer training examples makes the training process faster and more independent for the customer.

By eliminating those barriers, we are able today to pave the way for businesses to use machine learning applications not as isolated use cases but for digitizing their core processes. Our customers can label and train models for their documents by themselves. Consequently, they are faster in adapting to new document structures as well as to changes in demands for information.

As for now, our process does not completely eliminate human involvement in the case of purchase orders. Employees still provide the knowhow to train the system and then evaluate the output. The system is designed to make the work experience more interesting for individual employees while making it more profitable for the employers.

In Part III of our article series “Documents to Value”, we will further elaborate how our customers interact with the learning system and make it more intelligent with every click.