Extracting Data from PDF Invoices
How our machine learning model improves the user experience and reduces manual effort to a minimum.
At Advanon, we process large numbers of invoices every day. Automatically extracting relevant information from invoices in arbitrary formats is a challenging problem at which classical rule-based approaches often fail. In this article we explain how we developed a state-of-the-art machine learning model for this task that in many cases outperforms existing solutions on the market and will help improve the user experience. We are currently testing the model on a mobile app prototype that aims to make uploading invoices to our platform even more convenient.
Companies upload invoices on our platform to receive early payments and optimize their liquidity. They submit invoices either as scanned or digitally generated PDFs, as shown in Image 1.
To finance an invoice, a company needs to provide the following data from the invoice document:
- Invoice amount
- Issue date
- Due date
- Invoice number
- Customer name
Even though the UI for entering this data has been largely optimized, a user takes between 15 and 30 seconds on average to complete this step. Spending that time is not an issue for companies that only finance a few invoices every month. However, to finance hundreds of smaller invoices at once, even a very skilled user would need an hour or longer to fill in all the data. Thus, so far users had to resort to different ways of entering the data for such large amounts of invoices, for example to CSV files that contain all the necessary data. While this is an improvement over entering the information individually for each invoice, the process is still tedious and inherently prone to errors, especially for users that do not have a suitable accounting software that allows them to export the required data in a convenient way.
Creating our own PDF data extractor
To overcome these problems and enable invoice financing for companies with a large number of invoices, we set out to find a solution that automatically detects the required information on invoice documents, so that users simply upload their invoices in PDF format and wait just a few seconds until the data of all invoices has been extracted. To start with, we compared available solutions on the market to automatically read out the relevant fields. While several products perform well on certain data points, we were surprised not to find any reliable solutions that perform well on all fields we are interested in, while at the same time processing the information in a reasonable amount of time, ideally within a few seconds. Therefore, we decided to develop our own tool, capable of accurately extracting relevant data from invoices.
Training the machine learning model
It was clear from from the start that a possible increase in performance compared to existing solutions on the market would have to come from the quality of the machine learning approach, and not from the number of invoices available to train the model, since specialized providers likely have order of magnitudes more training data available.
We started by thinking how a typical user would approach the problem. Table 1 sketches the mental model a typical user might have of the five data points.
To obtain a training set for our machine learning model, we hand-crafted similar features for each text token on each invoice. For training the model, we assigned a label to each token in the dataset, indicating the type of information the token corresponds to, if any. Table 2 contains the most important features we considered.
While neural networks and deep learning are typically prime candidates for learning complex non-linear dependencies as we find them in this problem, the amount of data we had available was too small to train a large scale neural network. Instead, we experimented mainly with random forest models, SVMs and gradient boosting. Ultimately, we achieved the best trade-off between performance, memory requirements and speed of training using a random forest model. For the final model we considered a training set of around 1,000 invoices in German and English from around 300 different companies. Training and optimizing the model takes less than 30 minutes in our learning environment.
To get a first indication of the performance of our algorithm we used cross-validation, making sure to train the model on invoices of one set of companies and testing on invoices of another set of companies. This way, the algorithm is tested only on invoice formats it has not seen before. While this way of measuring performance is more restrictive than Advanon’s actual use-case, it serves as a good lower-bound performance indicator.
Overall we achieved very encouraging results, in many cases outperforming existing solutions on the market on either speed of extraction or accuracy. With our model, extraction works in real-time (less than 5 seconds for an average invoice), including network transfer and inferral. The accuracy of the prediction for the most important data points on an invoice is listed in Table 3 (true positives, false positives). The difference to 100% arises because the model only provides a prediction in case the confidence for having found the correct token is above a certain threshold.
Out of the three data points, due date is by far the most difficult to achieve a very high performance on, since often it is specified in relation to the issue date, e.g., as in “Payable within 30 days after delivery”. Thus, in many cases the due date can only be classified correctly if the issue date is also correct.
Invoices the algorithm fails on are typically edge cases, that even for most users would require a bit of analysis to interpret correctly. We are actively working on improving our model and training set to further reduce such misclassifications.
Invoice scanning app for convenient uploading
As a proof of concept, we also decided to create a mobile app that allows users to directly upload an invoice to the Advanon platform by simply scanning it. The following video shows the process of scanning and uploading an invoice.
The app will be particularly convenient for users who prefer to upload invoices directly from their mobile devices, thereby reducing the effort from a couple of minutes to less than 30 seconds.
¹The model has since been refined and re-trained on a larger dataset. The FPR is now closer to 1% in all categories.