Extracting Structured Data From Invoice
In this blog we will look how to process SROIE dataset and train PICK-pytorch to get key information from invoice.
Here is colab notebook click here to direct run tutorial code
SROIE dataset
For invoice dataset we are using ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction Compitition Dataset.
Reference :
Folder structure
data/
img/
000.jpg
001.jpg
box/
000.csv
001.csv
key/
000.json
001.json
Image Example
Csv data example
x1_1,y1_1,x2_1,y2_1,x3_1,y3_1,x4_1,y4_1,transcript_1
72,25,326,25,326,64,72,64,TAN WOON YANN
50,82,440,82,440,121,50,121,BOOK TA .K(TAMAN DAYA) SDN BND
205,121,285,121,285,139,205,139,789417-W
110,144,383,144,383,163,110,163,NO.53 55,57 & 59, JALAN SAGU 18,
192,169,299,169,299,187,192,187,TAMAN DAYA,
162,193,334,193,334,211,162,211,81100 JOHOR BAHRU,
....
Key data example
{
"company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
"date": "25/12/2018",
"address": "NO.53 55,57 & 59, JALAN SAGU 18,
TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
"total": "9.00"
}
Downloading dataset
#dataset
!git clone https://github.com/zzzDavid/ICDAR-2019-SROIE.gitCloning into 'ICDAR-2019-SROIE'...
remote: Enumerating objects: 94, done.[K
remote: Counting objects: 100% (94/94), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 2386 (delta 50), reused 65 (delta 22), pack-reused 2292[K
Receiving objects: 100% (2386/2386), 278.63 MiB | 23.17 MiB/s, done.
Resolving deltas: 100% (213/213), done.
Checking out files: 100% (1980/1980), done.
Preprocess Dataset
We will preprocess Dataset as per PICK-pytorch.
Reference: https://github.com/wenwenyu/PICK-pytorch/blob/master/data/README.md
Creating folders for preprocessed dataset
!mkdir boxes_and_transcripts images entities
Script for preprocessing dataset
import os
import pandas
import json
import csv
import shutil
## Input dataset
data_path = "ICDAR-2019-SROIE/data/"
box_path = data_path + "box/"
img_path = data_path + "img/"
key_path = data_path + "key/"
## Output dataset
out_boxes_and_transcripts = "/content/boxes_and_transcripts/"
out_images = "/content/images/"
out_entities = "/content/entities/"
train_samples_list = []
for file in os.listdir(data_path + "box/"):
## Reading csv
with open(box_path +file, "r") as fp:
reader = csv.reader(fp, delimiter=",")
## arranging dataframe index ,coordinates x1_1,y1_1,x2_1,y2_1,x3_1,y3_1,x4_1,y4_1, transcript
rows = [[1] + x[:8] + [','.join(x[8:]).strip(',')] for x in reader]
df = pandas.DataFrame(rows)
## including ner label dataframe index ,coordinates x1_1,y1_1,x2_1,y2_1,x3_1,y3_1,x4_1,y4_1, transcript , ner tag
df[10] = 'other'
##saving file into new dataset folder
jpg = file.replace(".csv",".jpg")
entities = json.load(open(key_path+file.replace(".csv",".json")))
for key,value in sorted(entities.items()):
idx = df[df[9].str.contains('|'.join(map(str.strip, value.split(','))))].index
df.loc[idx, 10] = key
shutil.copy(img_path +jpg, out_images)
with open(out_entities + file.replace(".csv",".txt"),"w") as j:
print(json.dumps(entities), file=j)
df.to_csv(out_boxes_and_transcripts+file.replace(".csv",".tsv"),index=False,header=False, quotechar='',escapechar='\\',quoting=csv.QUOTE_NONE, )
train_samples_list.append(['receipt',file.replace('.csv','')])
train_samples_list = pandas.DataFrame(train_samples_list)
train_samples_list.to_csv("train_samples_list.csv")
Folder structure after preprocessing
boxes_and_transcripts/
000.tsv
001.tsv
images/
000.jpg
001.jpg
entities/
000.txt
001.txt
Preprocessed data example
Here we only added ner tag into tsv file.
index ,x1_1,y1_1,x2_1,y2_1,x3_1,y3_1,x4_1,y4_1, transcript , ner tag
1,72,25,326,25,326,64,72,64,TAN WOON YANN,other
1,50,82,440,82,440,121,50,121,BOOK TA .K(TAMAN DAYA) SDN BND,address
1,205,121,285,121,285,139,205,139,789417-W,other
1,110,144,383,144,383,163,110,163,NO.53 55\,57 & 59\, JALAN SAGU 18,address
1,192,169,299,169,299,187,192,187,TAMAN DAYA,address
1,162,193,334,193,334,211,162,211,81100 JOHOR BAHRU,address
1,217,216,275,216,275,233,217,233,JOHOR.,address
1,50,342,279,342,279,359,50,359,DOCUMENT NO : TD01167104,other
1,50,372,96,372,96,390,50,390,DATE:,other
1,165,372,342,372,342,389,165,389,25/12/2018 8:13:39 PM,date## document_type, file_name
train_samples_list.head()
Spliting dataset into train-test sets
from sklearn.model_selection import train_test_split
train_test = pandas.read_csv("train_samples_list.csv",dtype=str)
train, test= train_test_split(train_test,test_size=0.2,random_state = 42)
Model
For deep learning model we will use PICK-pytorch model.
PICK is a framework that is effective and robust in handling complex documents layout for Key Information Extraction (KIE) by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity.
reference : https://github.com/wenwenyu/PICK-pytorch
@inproceedings{Yu2020PICKPK,
title={{PICK}: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks},
author={Wenwen Yu and Ning Lu and Xianbiao Qi and Ping Gong and Rong Xiao}, booktitle={2020 25th International Conference on Pattern Recognition (ICPR)}, year={2020}
}
!git clone https://github.com/wenwenyu/PICK-pytorch.gitCloning into 'PICK-pytorch'...
remote: Enumerating objects: 4, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 218 (delta 1), reused 0 (delta 0), pack-reused 214[K
Receiving objects: 100% (218/218), 9.97 MiB | 8.62 MiB/s, done.
Resolving deltas: 100% (86/86), done.
Copy train data into PICK-pytorch data folder
for index, row in train.iterrows():
shutil.copy(out_boxes_and_transcripts+str(row[2])+".tsv",'/content/PICK-pytorch/data/data_examples_root/boxes_and_transcripts/')
shutil.copy(out_images+str(row[2])+".jpg",'/content/PICK-pytorch/data/data_examples_root/images/')
shutil.copy(out_entities +str(row[2])+".txt", '/content/PICK-pytorch/data/data_examples_root/entities/')train.drop(['Unnamed: 0'], axis = 1,inplace = True)
train.reset_index(inplace= True)
train.drop(['index'], axis = 1,inplace = True)
train.to_csv("/content/PICK-pytorch/data/data_examples_root/train_samples_list.csv",header = False)
Copy test data into PICK-pytorch data folder
!mkdir '/content/PICK-pytorch/data/test_data_example/entities/'for index, row in test.iterrows():
shutil.copy(out_boxes_and_transcripts+str(row[2])+".tsv",'/content/PICK-pytorch/data/test_data_example/boxes_and_transcripts/')
shutil.copy(out_images+str(row[2])+".jpg",'/content/PICK-pytorch/data/test_data_example/images/')
shutil.copy(out_entities +str(row[2])+".txt", '/content/PICK-pytorch/data/test_data_example/entities/')
test.drop(['Unnamed: 0'], axis = 1,inplace = True)
test.reset_index(inplace= True)
test.drop(['index'], axis = 1,inplace = True)
test.to_csv("/content/PICK-pytorch/data/test_data_example/test_samples_list.csv",header = False)## Removing data once it is copied into PICK-pytorch data folder
!rm /content/boxes_and_transcripts/*.tsv
!rm /content/images/*.jpg
!rm /content/entities/*.txt%cd PICK-pytorch/
Modified config file as per PICK-pytorch Guidelines.
https://github.com/wenwenyu/PICK-pytorch#distributed-training-with-config-files
NOTE: You may modify train_dataset and validation_dataset args in config.json as per your folder structures.
%%writefile config.json
{
"name": "PICK_Default",
"run_id":"test",
"local_world_size":4,
"local_rank":-1,
"distributed":-1,
"model_arch": {
"type": "PICKModel",
"args": {
"embedding_kwargs":{
"num_embeddings": -1,
"embedding_dim": 512
},
"encoder_kwargs":{
"char_embedding_dim":-1,
"out_dim": 512,
"nheaders": 4,
"nlayers": 3,
"feedforward_dim": 1024,
"dropout": 0.1,
"image_encoder": "resnet50",
"roi_pooling_mode": "roi_align",
"roi_pooling_size": [7,7]
},
"graph_kwargs":{
"in_dim":-1,
"out_dim":-1,
"eta": 1,
"gamma": 1,
"learning_dim": 128,
"num_layers": 2
},
"decoder_kwargs":{
"bilstm_kwargs":{
"input_size": -1,
"hidden_size": 512,
"num_layers": 2,
"dropout": 0.1,
"bidirectional": true,
"batch_first": true
},
"mlp_kwargs":{
"in_dim": -1,
"out_dim": -1,
"dropout": 0.1
},
"crf_kwargs":{
"num_tags":-1
}
}
}
},
"train_dataset": {
"type": "PICKDataset",
"args": {
"files_name":"/content/PICK-pytorch/data/data_examples_root/train_samples_list.csv",
"boxes_and_transcripts_folder":"/content/PICK-pytorch/data/data_examples_root/boxes_and_transcripts",
"images_folder":"/content/PICK-pytorch/data/data_examples_root/images",
"entities_folder":"/content/PICK-pytorch/data/data_examples_root/entities",
"iob_tagging_type":"box_and_within_box_level",
"resized_image_size": [480, 960],
"ignore_error": false
}
},
"validation_dataset": {
"type": "PICKDataset",
"args": {
"files_name":"/content/PICK-pytorch/data/test_data_example/test_samples_list.csv",
"boxes_and_transcripts_folder":"/content/PICK-pytorch/data/test_data_example/boxes_and_transcripts",
"images_folder":"/content/PICK-pytorch/data/test_data_example/images",
"entities_folder":"/content/PICK-pytorch/data/test_data_example/entities",
"iob_tagging_type":"box_and_within_box_level",
"resized_image_size": [480, 960],
"ignore_error": false
}
},
"train_data_loader": {
"type": "DataLoader",
"args":{
"batch_size": 4,
"shuffle": true,
"drop_last": true,
"num_workers": 8,
"pin_memory":true
}
},
"val_data_loader": {
"type": "DataLoader",
"args":{
"batch_size": 4,
"shuffle": false,
"drop_last": false,
"num_workers": 8,
"pin_memory":true
}
},
"optimizer": {
"type": "Adam",
"args":{
"lr": 0.0001,
"weight_decay": 0,
"amsgrad": true
}
},
"lr_scheduler": {
"type": "StepLR",
"args": {
"step_size": 30,
"gamma": 0.1
}
},
"trainer": {
"epochs": 100,
"gl_loss_lambda": 0.01,
"log_step_interval": 10,
"val_step_interval": 50,
"save_dir": "saved/",
"save_period": 20,
"log_verbosity": 2,
"monitor": "max overall-mEF",
"monitor_open": true,
"early_stop": 40,
"anomaly_detection": false,
"tensorboard": false,
"sync_batch_norm":true
}
}Overwriting config.json
entities_list.py contains the name of your entity class.
Here we have 4 entities
- company
- address
- date
- total
%%writefile utils/entities_list.py
Entities_list = [
"company",
"address",
"date",
"total"
]
Installing requirements for running PICK-pytorch.
!pip install -r requirements.txt
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
Training
Train at least 100 epoch for better results. You can change gpu devices list if you have multiple gpu by -d
args.
Reference : https://github.com/wenwenyu/PICK-pytorch#distributed-training-with-config-files
#!/bin/bash
!python -m torch.distributed.launch --nnode=1 --node_rank=0 --nproc_per_node=1 \
train.py -c config.json -d 0 --local_world_size 1
# --resume /content/PICK-pytorch/saved/models/PICK_Default/test_0917_074722/model_best.pth ##uncomment for resume training
[2020-10-03 09:55:08,494 - trainer - INFO] - Train Epoch:[22/100] Step:[250/250] Total Loss: 45.489735 GL_Loss: 0.946765 CRF_Loss: 44.542969
[2020-10-03 09:55:41,285 - trainer - INFO] - [Step Validation] Epoch:[22/100] Step:[250/250]
+---------+----------+----------+----------+----------+
| name | mEP | mER | mEF | mEA |
+=========+==========+==========+==========+==========+
| address | 0.742765 | 0.55 | 0.632011 | 0.55 |
+---------+----------+----------+----------+----------+
| company | 0.54717 | 0.623656 | 0.582915 | 0.623656 |
+---------+----------+----------+----------+----------+
| total | 0.591111 | 0.461806 | 0.518519 | 0.461806 |
+---------+----------+----------+----------+----------+
| date | 0.820359 | 0.88961 | 0.853583 | 0.88961 |
+---------+----------+----------+----------+----------+
| overall | 0.690977 | 0.58534 | 0.633787 | 0.58534 |
+---------+----------+----------+----------+----------+
[2020-10-03 09:55:45,078 - trainer - INFO] - Saving current best: model_best.pth ...
[2020-10-03 09:56:18,330 - trainer - INFO] - [Epoch Validation] Epoch:[22/100] Total Loss: 73.259899 GL_Loss: 0.011341 CRF_Loss: 72.125789
+---------+----------+----------+----------+----------+
| name | mEP | mER | mEF | mEA |
+=========+==========+==========+==========+==========+
| address | 0.743506 | 0.545238 | 0.629121 | 0.545238 |
+---------+----------+----------+----------+----------+
| company | 0.566038 | 0.645161 | 0.603015 | 0.645161 |
+---------+----------+----------+----------+----------+
| total | 0.570796 | 0.447917 | 0.501946 | 0.447917 |
+---------+----------+----------+----------+----------+
| date | 0.788235 | 0.87013 | 0.82716 | 0.87013 |
+---------+----------+----------+----------+----------+
| overall | 0.681481 | 0.57801 | 0.625496 | 0.57801 |
+---------+----------+----------+----------+----------+
[2020-10-03 09:56:42,240 - trainer - INFO] - Train Epoch:[23/100] Step:[10/250] Total Loss: 31.438135 GL_Loss: 0.875147 CRF_Loss: 30.562988
Testing
Creting testing folder
##creating testing folders
!mkdir /content/test_img /content/test_boxes_and_transcripts## copy one file from test sample
import os
import shutil
data_path = "data/test_data_example/boxes_and_transcripts/"
image_path = "data/test_data_example/images/"
out_img_path = "/content/test_img/"
out_box_path = "/content/test_boxes_and_transcripts/"
for file in os.listdir(data_path)[:10]:
shutil.copy(data_path+file,out_box_path)
shutil.copy(image_path+file.replace(".tsv",".jpg"),out_img_path)
Prediction
## change model_best.pth path
!python test.py --checkpoint saved/models/PICK_Default/test_1003_053713/model_best.pth \
--boxes_transcripts {out_box_path} \
--images_path {out_img_path} --output_folder /content/output/ \
--gpu 0 --batch_size 22020-10-03 10:07:50.457224: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Loading checkpoint: saved/models/PICK_Default/test_1003_053713/model_best.pth
with saved mEF 0.6338 ...
5it [00:02, 1.83it/s]
You can see predictions in output folder.
It will look something like this
company ADVANCO COMPANY,co
address NO 1&3\, JALAN ANGSA DELIMA 12
address WANGSA LINK\, WANGSA MAJU
address 53300 KUALA LUMPUR
date 23/03/2018
Look in colab notebook for better indentation click here
If you like this post, HIT Buy me a coffee! Thanks for reading.
Your every small contribution will encourage me to create more content like this.