Extracting Data from PDF Invoices

How our machine learning model improves the user experience and reduces manual effort to a minimum.

Background


Companies upload invoices on our platform to receive early payments and optimize their liquidity. They submit invoices either as scanned or digitally generated PDFs, as shown in Image 1.

Image for post
Image for post
Image 1: Invoice upload process

To finance an invoice, a company needs to provide the following data from the invoice document:

  1. Invoice amount
  2. Issue date
  3. Due date
  4. Invoice number
  5. Customer name

Even though the UI for entering this data has been largely optimized, a user takes between 15 and 30 seconds on average to complete this step. Spending that time is not an issue for companies that only finance a few invoices every month. However, to finance hundreds of smaller invoices at once, even a very skilled user would need an hour or longer to fill in all the data. Thus, so far users had to resort to different ways of entering the data for such large amounts of invoices, for example to CSV files that contain all the necessary data. While this is an improvement over entering the information individually for each invoice, the process is still tedious and inherently prone to errors, especially for users that do not have a suitable accounting software that allows them to export the required data in a convenient way.

Creating our own PDF data extractor

Training the machine learning model

We started by thinking how a typical user would approach the problem. Table 1 sketches the mental model a typical user might have of the five data points.

Image for post
Image for post
Table 1: Schematic user model of detecting information on an invoice

To obtain a training set for our machine learning model, we hand-crafted similar features for each text token on each invoice. For training the model, we assigned a label to each token in the dataset, indicating the type of information the token corresponds to, if any. Table 2 contains the most important features we considered.

Image for post
Image for post
Table 2: Selection of relevant machine learning features

While neural networks and deep learning are typically prime candidates for learning complex non-linear dependencies as we find them in this problem, the amount of data we had available was too small to train a large scale neural network. Instead, we experimented mainly with random forest models, SVMs and gradient boosting. Ultimately, we achieved the best trade-off between performance, memory requirements and speed of training using a random forest model. For the final model we considered a training set of around 1,000 invoices in German and English from around 300 different companies. Training and optimizing the model takes less than 30 minutes in our learning environment.

To get a first indication of the performance of our algorithm we used cross-validation, making sure to train the model on invoices of one set of companies and testing on invoices of another set of companies. This way, the algorithm is tested only on invoice formats it has not seen before. While this way of measuring performance is more restrictive than Advanon’s actual use-case, it serves as a good lower-bound performance indicator.

Results

Image for post
Image for post
Table 3: Overview of model performance for selected data points¹

Out of the three data points, due date is by far the most difficult to achieve a very high performance on, since often it is specified in relation to the issue date, e.g., as in “Payable within 30 days after delivery”. Thus, in many cases the due date can only be classified correctly if the issue date is also correct.

Image for post
Image for post
Image 2: Example of correctly classified amount

Invoices the algorithm fails on are typically edge cases, that even for most users would require a bit of analysis to interpret correctly. We are actively working on improving our model and training set to further reduce such misclassifications.

Invoice scanning app for convenient uploading

Image for post
Image for post
Image 3: Invoice uploading process

The app will be particularly convenient for users who prefer to upload invoices directly from their mobile devices, thereby reducing the effort from a couple of minutes to less than 30 seconds.

Image for post
Image for post
Image 4: Testing the app prototype

For additional information or feedback please get in touch with christoph@advanon.com or mingda.liu.zhang@advanon.com.

¹The model has since been refined and re-trained on a larger dataset. The FPR is now closer to 1% in all categories.

All About Advanon

This is the blog of Swiss online platform www.advanon.com.

Christoph Hirnschall

Written by

Head of R&D at Advanon

All About Advanon

This is the blog of Swiss online platform www.advanon.com. We aim to make businesses thrive. Offering pioneering financing products, credit intelligence and expert advice — success is our drive.

Christoph Hirnschall

Written by

Head of R&D at Advanon

All About Advanon

This is the blog of Swiss online platform www.advanon.com. We aim to make businesses thrive. Offering pioneering financing products, credit intelligence and expert advice — success is our drive.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store