Old-My Experience Extracting Invoice Data Using invoice2data in Python

Satheesh Mohan
Version 1
Published in
3 min readFeb 3, 2023
Photo by Lukas Blazek on Unsplash

I am a technical consultant, who has been working as part of Version’s 1 Innovation Labs since August 2019. The Innovation Labs develops innovative solutions for its customers by creating POC (Proof of Concept) and POV (Proof of Value) to prove the solution to the customer’s problems and extend to solutions to a real customer problem.

I had an opportunity to work on extracting invoice data in the Innovation Labs, in which I came across the invoice2data python package, which extracts data from fields defined in template invoices.

What is invoice2data?

Invoice2data is created by Invoice-X and can extract structured data from PDFs using a template system. Invoice2data works best on text PDFs, but can also use different OCR libraries as a data extractor for PDF invoices.

Extraction Issues

I had an issue in extracting multi-line text data from and to custom regex to read complex tables.

To resolve this, I has an idea to customize invoice2data to define area cropping with coordinates. Also, I extendend the current template to accommodate the custom configuration. The custom fields that were defined are in the below YAML template.

Below is the example invoice image used to extract the data, and the template used.

Sample Invoice

Yaml Template

# -*- coding: utf-8 -*-
issuer: ABC
file: ''
fields:
tel: 'Tel (\+\w+(?:[ -)(]\w\w+)*)'
email: 'Email:[ ]+(\w+@\w+.\w+)'
website: 'Website:[ ]+(\w+.\w+.\w+)'
vat_no: 'Vat no:[ ]+(\w+)'
date: 'Date:[ ]+(\d{2}[/-]\d{2}[/-]\d{4})'
invoice_number: 'Invoice Number[ ]+(\w+)'
contact: 'Contact[ ]+(\w+)'
bank: 'Bank\s+(\w+(?:[ ]\w\w+)*)'
address: 'Address\s+(\w+(?:[ ,]\w\w+)*)'
iban: 'IBAN\s+(\w+(?:[ ]\w\w+)*)'
bic: 'BIC\s+(\w+)'
due_d: 'Due Date\s+(\w+)'
sub_total: 'Subtotal\s+(\d+.\d+)'
vat: 'Vat\s+(\d+)%\s+(\d+.\d+)'
keywords:
- ABC
- 3 Middle Street
- London, UK
custom:
items:
- name: 'to'
area: (135, 878, 686, 349)
- name: 'from'
area: (1398, 325, 489, 197)
regex:
line: \s+(?P<desc>(\w+(?:[ ]\w\w+)*))\s+(?P<date>(\d{2}[-]\d{2}[-]\d{4}))\s+(?P<item>(\w+))\s+(?P<qty>(\d+))\s+(?P<unit_price>[€ ](\d+.\d+))\s+(?P<total>[€ ](\d+.\d+))\n
required_fields:
- tel
options:
currency: €
date_formats:
- '%d-%m-%Y'
remove_whitespace: False
languages:
- en
replace:
- ['‘', '']
- ['[', '']
- ['€', '']
decimal_separator: '.'

Output

The coordinates defined in the template will be different when the resolution of the input image was changed. This has been considered in invoice2data customisation.

To further enhance features, Machine Learning can be used to remove the template.

My Recommendations

I strongly recommend Invoice2data for invoice processing, as long as you don't have many changes to your invoice template. Invoice2data provides the power of adding our own custom plugins, which is great to suit our own specific requirements. However, invoice2data requires good RegEx skills in order to create an invoice template.

Learn More

The Innovation Labs has been in action since 2018 and has had many success stories in the form of successful collaborative Proof Of Values (PoVs). We are keen on engaging more without current and new customers to demonstrate how the latest technologies can add value to their business. To find out more about Innovation at Version 1, visit us here.

About the Author:
Satheesh Mohan is a Solution Architect here at Version 1

--

--