Image for post
Image for post

Digital transformation has become a top priority for companies and this trend is about to accelerate even more in the light of the current situation with COVID-19. As part of this development, companies and public institutions make an effort to digitize their documents and therefore collect a vast amount of data every day.

Which may pose the question: Is there a point where we will no longer work with documents?

Understanding Documents: Human vs. Machine

First, let us contrast how machines read and understand documents to how humans understand documents.

Machines can only process structured data. For a machine, structured data are strings of numbers or characters. The information is represented in a way, that they can just process it, as the logic is given, and all information is explicit. Documents such as a scan, pdf or email are generally more complex, a lot of information is implicitly given. The great variety in layouts, structures, inherent logic and context knowledge makes it extremely difficult for automated processing. …


Image for post
Image for post

Identifying tables in documents and decoding their structure is one of the biggest challenges in automated document analysis. Since we often run our information extraction engine MINT.extract on documents containing tables, we face this challenge very often at turicode. Therefore, we decided to research the ideal machine learning algorithms to detect and recognize tables in documents.

Analyzing tables with a machine learning model is a difficult task. Tables in documents come in very different shapes and sizes. Some have visible lines, some do not. Some have bold text for headers, others use background colors. …


Image for post
Image for post

In an earlier article, we explained what you should consider when you prepare a training set for your machine learning system. Once you have a set ready, you will want to train your system on it. An important part of training is the evaluation. To help you with the evaluation of your machine learning system, we will present you three useful measures to evaluate the performance of your system.

How can I evaluate a black box?

Many people think of machine learning systems as black boxes. You give some form of input, you receive some form of output — what happens in between, no one knows. While it is true that some machine learning systems do not give us direct reports about how they transform input into output, this does not necessarily mean that we can not observe and control what they are doing. With the right evaluation techniques, you can track the training of your machine learning system and adopt measures to improve it. …


turicode’s co-founder Martin Keller recently went on a business trip to Hong Kong with Venture Leaders Fintech. In this article, he gives us an insight into how he experienced Hong Kong and the Greater Bay Area, and why he thinks that this region might become the next Silicon Valley within a few years.

Image for post
Image for post
During the Venture Leaders Fintech trip, Martin Keller presented turicode to potential customers and partners in Hong Kong and the Greater Bay Area.

I just returned from a business trip to Hong Kong. As part of the Venture Leaders Fintech cohort — a roadshow co-organized by Venturelab and swissnex China to create business opportunities for fintech startups in the far East –, I met representatives of international companies, local authorities, investors and fintech exponents from the Greater Bay Area. This region consists of nine cities in the province of Guangdong and the special administrative areas Hong Kong and Macau, all of which are vibrant economical centers. …


Image for post
Image for post

If you want your machine learning system to be more reliable than any random guess generator, providing a good training set is crucial. Here are three things you should keep in mind if you want to put together a training set which will make your machine learning system perform at its best:

1. The training set needs to represent the production set

It seems obvious, but in practice, this is one of the main reasons why machine learning systems fail. …


Image for post
Image for post

On August 28 and 29, we presented turicode and our data extraction engine MINT.extract at the Swiss IT fair topsoft at Umwelt Arena Schweiz in Spreitenbach. At our stand, we welcomed visitors with a shot of mojito syrup and invited them to participate in a non-representative survey on the digitization of documents that we had printed onto two posters. The answers of our 52 participants mostly working for small and medium enterprises resulted in some interesting findings:

Finding 1: Manual copy-pasting is every-day business

With one exception, all participants of our survey stated that they manually copy or type valuable information out of documents on a daily basis. The majority of them has to copy-paste information more than five times a day. A fifth of all participants even stated that they manually copy-paste information more than 30 times a day. …


Image for post
Image for post
MINT.extract in production mode

Recently turicode has launched the technology MINT.extract as a standardized product. Unstructured information can be extracted independently with a few clicks by the customer. The three license models “Quick Start”, “Professional” and “Enterprise” offer multiple functionalities to cover the different needs of small companies as well as bigger corporates. We asked Aaron Richiger, our machine learning expert and research lead, a few questions about the use of machine learning in MINT.extract.

Aaron, how does turicode use machine learning to digitize documents?
We use machine learning in different steps in the pipeline from semi- or unstructured documents to the structured machine-readable output. Often it is the documents’ complexity which forces us to make use of new technologies like machine learning. For example, changing layouts in documents bring rule-based approaches to their limits, therefore, we need a more flexible approach to be able to fulfil all requirements. In this case, we or our customers train a self-learning system to extract the relevant information fully automated. …


Image for post
Image for post

Software as a Service (SaaS) is very advantageous as it offers real time processing with easy, frequent updates and overall great service. However, our customers are rightfully concerned about the security of their data as it is often their most important asset. turicode is aware of this and therefore processes sensitive customer data based on the highest security standards. Here is how we deal with it.

Our Servers

Your dedicated services run on our internal serves or on a trusted, Swiss Data Centre provider who is ISO certified and FINMA approved. …


Image for post
Image for post

In the fourth article of our series “Documents to Value”, we will take the time to outline some of the best practices of information retrieval from documents from an IT architecture point of view. Through different projects, we have learnt that there are four critical factors which should be considered for a successful integration of a machine learning system into an existing technical landscape.

Expandable

When setting up a new data extraction service, our customers typically start with 1–3 use cases to be covered. However, once the service is in production, new visions emerge, and often additional use cases should be processed on the same application. Hence, a comprehensive solution can be extended into two dimensions. Firstly, the architecture needs to extend vertically to accommodate new document types (e.g. purchase orders, shipping receipts, bank statements) flexibly and without programming skills. Secondly, more services can be added to the overall pipeline (extending it horizontally) to increase the overall value and user experience. …


It has become common knowledge that a good user experience is key to any great product. In this third part of our series “Documents to Value”, we will highlight what the user experience looks like when using MINT.extract. So far in this series, we have looked at the business side of our solution as well as how turicode makes use of machine learning methods to improve the results for our customers.

In the following, we will illustrate what exactly this initial training entails through a step by step walkthrough.

1 — Define the relevant information: A domain expert needs to clearly define what information needs to be extracted from the documents and in what format. If this information changes depending on the documents, then this needs to be defined as well. In the example of a purchase order, we can see that the item number, the quantity and the total price are important. We also see that the price per item is not needed. Further, it has been defined exactly what kind of parameters the label total can have. For example, that it must be an integer, and therefore, cannot contain letters. In the next step, the relevant documents need to be selected. …

About

turicode Inc.

Truly refreshing document digitization!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store