Three things you should consider when preparing a training set for your machine learning system

Image for post
Image for post

If you want your machine learning system to be more reliable than any random guess generator, providing a good training set is crucial. Here are three things you should keep in mind if you want to put together a training set which will make your machine learning system perform at its best:

1. The training set needs to represent the production set

It seems obvious, but in practice, this is one of the main reasons why machine learning systems fail. If the training system does not match the production set, biases are introduced in the machine learning process and, as a result, the accuracy of the data analysis executed by the system will be unsatisfactory.

At turicode, we extract information from documents with a machine learning system. If we want our engine MINT.extract to read data out of purchase orders by several clients, we need to make sure that we train the system with purchase orders by all clients we want to include in the service, and that the ratio between documents by different clients is the same in the training set as it will be in production. Otherwise, the system might perform well on documents by one client, but poorly on documents by others.

The consequence of all this is that you need to know what you will use your machine learning system for from the start. You probably would like best to just take off with the training of your system and feed it all the data you have, but it is worth taking some time to think about the actual data you want to detect with your system before you start feeding it. Ask yourself the following questions to find the right data for your training set:

  • What data do I want to detect with my machine learning system?
  • What data sources will I draw data from?
  • Do I need to pre-process my data before running it through my system?
  • Are there other types of data/data sources/data processing I need to consider?

In this context, it makes sense to not only think about the composition of your training set, but also about the composition of your production set. Are there any measures that you could adopt to improve the quality of your production set? Can you improve the pre-processing of your data, e.g. by buying a more accurate scanner for your documents or applying a stronger text recognition engine to your scans? If you can improve the quality of your production data samples, you can also improve the quality of the samples in your training set, and therefore, feed your machine with better training material.

Obviously, you will never achieve the goal of having only representative, bias-free, highest-quality training data, but if your system is well-trained on 99 percent of possible data representations, it will do a good job on the remaining representations as well.

2. Size matters — but you can compensate for it

Image for post
Image for post

There is a reason why large tech companies release their machine learning algorithms, but not their training data: the real value lies in the data. The best algorithm is worth nothing without the right training data. Depending on the complexity of the data you want to extract, the number of data samples you need for your training set can rise to enormous levels. If, for example, you want to extract data from random pictures, you will need hundreds of thousands of examples for your training set.

Yet, you should not compile a mountain of data only for the sake of having more data. In fact, if you gather the samples for your training set systematically, you can reduce your data mining efforts significantly. As mentioned before, you should collect high-quality data for as many different data representations as possible. If your data is of good quality, you can train your system with just a few examples per data point you want to detect.

To give you a concrete example: when we at turicode want to extract product numbers from purchase orders for three specific clients, we take a handful of purchase orders per client and let our system train on them. This leads to much better results than training the system on thousands of purchase orders we found on the internet or in a database, of which none were issued by our three clients.

In our projects, a training set is normally built with just a handful to a few dozens of documents per layout we want to analyze. Our credo is to include as much data in the training set as necessary, but as little as possible. Smaller training sets also give us more flexibility in adjusting them.

3. The machine learning system is only as precise as its teacher

In order to teach your system which data it has to detect, you have to label the data in your training set. Often, this task is time-consuming and repetitive. However, you should give it enough attention, because if you make mistakes in labelling, the system will train on your mistakes and repeat them.

A smart move is to delegate the labelling of your training data to field experts. They know best which data to label and how to categorize it. The problem with this is that field experts are not necessarily coding genii. Therefore, you should provide some sort of labelling interface to them.

At turicode, we have built a labelling editor which lets people label data points in documents by simply clicking on the words of interest and selecting the right label from a menu. In this way, we can let our example purchase orders be labelled by accountants who know which data points are relevant in this document type.

Image for post
Image for post

You might notice that even if you delegate the labelling to field experts, there can be inconsistencies in your training data. These inconsistencies occur because data is sometimes ambiguous. Some of these inconsistencies can be eliminated by providing training to the experts and giving them clear labelling guidelines.

But even then, some inconsistencies might remain because they are just inherent to your data. In those cases, the experts cannot know which label to set, and consequently, you cannot expect your machine learning system to know. On the positive side, this means that you do not have to reach a 100 percent accuracy with your system. If it works as good as a human, it is already fine.

Ready? Go!

As you have seen, preparing a good training set is crucial to making your machine learning system work in a reliable way. This is definitely not the most interesting part of building a machine learning system. However, it is absolutely worth the effort. Because once your training set is all set up, you will be able to go through the training of your system with much less errors and within shorter time.

To learn more about turicode visit or send us an email to

Truly refreshing document digitization!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store