My Experience Extracting Invoice Data Using invoice2data in Python

Satheesh Mohan
Version 1
Published in
3 min readJul 7, 2020
Photo by Lukas Blazek on Unsplash

I am a technical consultant, who has been working as part of Version’s 1 Innovation Labs since Last August 2019. The Innovation Labs develops innovative solutions to its customers by creating POC (Proof of Concept) and POV (Proof of Value) to prove the solution to the customer’s problems and extend to solutions to a real customer problem.

I had an opportunity to work on extracting invoice data in the Innovation Labs, where I came across the invoice2data python package, that extracts data from defined fields in template invoices.

What is invoice2data?

invoice2data is created by Invoice-X, and is capable of extracting structured data from PDFs using a template system. invoice2data works best on text PDFs, but can also use different OCR libraries as a data extractor for PDF invoices.

My Experience

As described on its website, the main steps that invoice2data are as follows:

A modular Python library to support your accounting process. Tested on Python 2.7 and 3.4+. Main steps:

extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR – tesseract, tesseract4 or gvision (Google Cloud Vision).

searches for regex in the result using a YAML-based template system

saves results as CSV, JSON or XML or renames PDF files to match the content

When using invoice2data, I encountered an issue where I could not extract multi-line text data from and to custom regex to read complex tables. To resolve this, I came up with an idea to customise invoice2data to define area cropping with coordinates. I also extended the current template to accommodate the custom configuration. The custom fields that were defined are in the below YAML template. This is the example invoice used to extract the data, and the template used.

Sample Invoice

Yaml Template

# -*- coding: utf-8 -*-
issuer: ABC
file: ''
fields:
tel: 'Tel (\+\w+(?:[ -)(]\w\w+)*)'
email: 'Email:[ ]+(\w+@\w+.\w+)'
website: 'Website:[ ]+(\w+.\w+.\w+)'
vat_no: 'Vat no:[ ]+(\w+)'
date: 'Date:[ ]+(\d{2}[/-]\d{2}[/-]\d{4})'
invoice_number: 'Invoice Number[ ]+(\w+)'
contact: 'Contact[ ]+(\w+)'
bank: 'Bank\s+(\w+(?:[ ]\w\w+)*)'
address: 'Address\s+(\w+(?:[ ,]\w\w+)*)'
iban: 'IBAN\s+(\w+(?:[ ]\w\w+)*)'
bic: 'BIC\s+(\w+)'
due_d: 'Due Date\s+(\w+)'
sub_total: 'Subtotal\s+(\d+.\d+)'
vat: 'Vat\s+(\d+)%\s+(\d+.\d+)'
keywords:
- ABC
- 3 Middle Street
- London, UK
custom:
items:
- name: 'to'
area: (135, 878, 686, 349)
- name: 'from'
area: (1398, 325, 489, 197)
regex:
line: \s+(?P<desc>(\w+(?:[ ]\w\w+)*))\s+(?P<date>(\d{2}[-]\d{2}[-]\d{4}))\s+(?P<item>(\w+))\s+(?P<qty>(\d+))\s+(?P<unit_price>[€ ](\d+.\d+))\s+(?P<total>[€ ](\d+.\d+))\n
required_fields:
- tel
options:
currency: €
date_formats:
- '%d-%m-%Y'
remove_whitespace: False
languages:
- en
replace:
- ['‘', '']
- ['[', '']
- ['€', '']
decimal_separator: '.'

Output

Coordinates defined in the template will be different when the resolution of the input image was changed. This has been considered for the invoice2data customisation.

To further enhance features, Machine Learning can be used to remove the template.

Conclusion

I strongly recommend Invoice2data for invoice processing, as long as you don’t have many changes to your invoice template. Invoice2data provides the power of adding own custom plugins, which is great to suit our own specific requirements. However, invoice2data requires good RegEx skills in order to create an invoice template.

Learn More

The Innovation Labs has been in action since 2018 and has had many success stories in the form of successful collaborative Proof Of Values (PoVs). We are keen on engaging more without current and new customers to demonstrate how the latest technologies can add value to their business. To find out more about Innovation at Version 1, visit us here.

--

--