Training an ML Model with Anonymous Documents

Ward Van Laer
Ixor
Published in
4 min readOct 27, 2020

Classification and detection systems for documents like invoices or payslips can be used to automate and improve complex document-processing-flows. Our Ixorthink “Toolbox” contains various building blocks to create such flows: invoice detection, keyword detection, classification based on layout, content classification etc.

Training such models brings along some difficulties. First of all, the data used for training is very sensitive (it’s hard to get appropriate datasets, GDPR rules apply, datasets can’t be reused for other clients). Moreover, the content of these documents is inherently complex: financial documents contain many “unknown” words like names or addresses combined with numeric data. This means that for these words there is no embedding available

Photo by Wesley Tingey on Unsplash

You can find more information about the inner workings of our invoice NER-model in this blog: “Using Neural Networks for Invoice Recognition”

Creating Anonymous documents

To create an anonymous dataset, we propose two techniques: one supervised and one unsupervised. In this blog we will focus on the unsupervised method, which is the simplest and can easily be used on a wide range of datasets.

As an example, we try this on our invoicing dataset which contains more than 1k different invoice layouts. Sensitive data includes person names, company names, addresses, bank account details, order size (amounts) and even layout and background images such as logos.

Removing background images

To process pdf documents, we use an internal JSON-format which is generated by an OCR model. These JSON-files contain all layout and content information: font size, words and their location. If we want to remove logos and colour-information from the invoice, it suffices to generate a new pdf-file based on these JSON-files. The generated pdf is a black-and-white stripped-down version containing only the detected words in the right position.

Anonymising the content

We are going to anonymise the content based on term-document-frequency. We build on the fact that typically, sensitive data is different in most documents, e.g. person names and addresses change, while important terms like “Invoice number”, “Amount” and “Total” are contained in almost every document. This means they should be easily detected based on word frequency. Specifically, we select all words that are in less than 5% of the documents and transform them into gibberish by changing random letters.

As a general rule, all words containing digits are automatically transformed to random strings too. However, we do take into account that digits are changed to other digits, and letters to other letters.

Part of an anonymised invoice.

While the result might look like it’s written in some sort of unknown language, we should only need a few important words to extract the necessary information. Theoretically, we should be able to train our invoice model on the newly created anonymous data which hopefully results in the same performance as before. As an additional feature, we can look at this technique as an augmentation technique which may help the neural network to generalise.

Our Ixorthink “Toolbox” contains various building blocks to create such flows: invoice-detection, keyword-detection, classification based on layout, content classification etc.

Results

To measure the performance of a detection model, we use the recall@k scoring metric. This metric is mostly used for recommender systems. For k=1 this means that a field is detected correctly only if the candidate with the highest probability is the correct one. In other words: the invoice number field is correctly detected if the word with the highest detection probability of being an invoice number is indeed the invoice number. We compute this recall@1 score on a small testset of 61 invoices and 10 labels.

To check if training on anonymous data can yield similar model performance, we calculate the overall recall@1 score and their confidence interval. (The confidence intervals are calculated using the best model from an experiment of 10 training runs. You can find more information about confidence intervals for ML here)

Original dataset:

Overall Recall@1 (10 labels) with 95% confidence: [0.963-0.989]

Anonymous dataset:

Overall Recall@1 (10 labels) with 95% confidence: [0.938-0.980]

As you can see, the interval of the original model is only slightly higher. This proves certainly that no important information or connotations are lost in the process of document anonymisation and that anonymous data can be used for training without significant loss of performance.

At IxorThink we are constantly trying to improve our methods to create state-of-the-art solutions. As a software company, we can provide stable and fully developed solutions. Feel free to contact us for more information.

--

--

Ward Van Laer
Ixor
Editor for

Machine Learning Engineer at Ixor | Magician