Extracting Text and Invoice Number from various File Formats using NLP

Sathvicknarahari
The Startup
Published in
5 min readMay 12, 2020

A lot of files have certain information which we want to extract for various uses.This article is helpful for anyone who wants to extract the entire text/information or only certain fields/texts from a file .The files I have worked on here are invoice files from which I want to extract certain information such as GST number,invoice number etc.

Source : https://www.shutterstock.com/image-vector/minimal-yellow-invoice-template-vector-design-1017479284

The entire project is divided into three tasks they are

  1. Data Extraction
  2. Data Preprocessing
  3. Data Frame

I have sub divided each task in this project for my convenience . Each task has similar sub-tasks like

1.Doing the task on one single pdf file.

2.Doing the task on one single image file

3.Doing the task on multiple pdf files

4.Doing the task on multiple image files

5.Doing the task on all files altogether.

This way of dividing a particular task helps a lot. In every task I have to first figure out how to perform it on a single file format.

Now all I need to figure out is how I can write my code in a way that the task can be done iteratively on all files. It is one thing to perform a task on a single file format and it is an another thing to make it work on all file formats together. This entirely depends on how well you write conditional and iterative statements.

Diving into tasks:

1. Data Extraction

The data can be extracted from different file formats(pdf,tif).For pdf files we use textract to extract text.Sometimes pdf files contain images in it then PyPDF2 cannot extract text, pdf2image will be helpful in such case. As it is difficult to check everytime if a pdf has image or not.It is better to convert all pdf files into images and later extract text from them using pytesseract.

One of the major problem is when a pdf has images in it and number of pages is more than one, we need to keep check of number of pages in each pdf file and make sure that no data is overwritten by data in the next pages.

While extracting a single file it is simple as we provide location of only that particular file. As we try to extract from all files we need to use os package such that we can store that current directory in which we are working on and extract all files present in it. We can store all files names in a list along with their extensions this will help us to extract them accordingly depending on their formats.

We are using textract to extract text from pdf files by using the method process which takes arguments such as path to file
fig 1:Extracting text from pdf file
We are using pytesseract and PIL packages to extract text from images.
fig 2:Extracting text from image file

2. Data Pre-processing

After the data is extracted from all files we store them as strings in a list on which we perform some pre-processing tasks . Let us consider a particular text which is extracted and perform all pre-processing tasks on it.

fig 3:The original text extracted from pdf file.

It is a good practice to convert the entire text into lower case. We need to replace some characters and next lines in text with spaces such that the text looks more clean. We need to import stopwords from nltk.corpus , stopwords is a list of words which do not provide meaning to sentences and it is easy to remove them as they are already provided by nltk . The text after preprocessing shall be stored in a seperate list and also care will taken such that the original text is not lost.

fig 4:Preprocessing on the text extracted
fig 5:Removing stopwords from the text
fig 6:Text after preprocessing and removing stopwords

WordCloud is one of the effective way to know the words which are most repetitive in the text. In wordcloud the words which are bigger in size are the words which are repeated frequently. It is easy to create wordcloud by using inbuilt packages WordCloud and matplotlib.pyplot. We can adjust the size of the wordcloud by changing the figsize.

fig 7:WordCloud

3.Data Frame

This is one of the crucial task. In this task we take out the desired text from the entire pre-processed text. Here I want to extract invoice number from a particular pdf . Generally in invoice files we can find invoice number right after the text “invoice number” but the word invoice number is found more than once in any invoice file . So I store all the five words which are beside the word invoice throughout the text in a list,out of which there is only one invoice number which can be found out by using isnumeric() function on all the elements of the created list.

fig 8:Extracting Invoice number from preprocessed text.

The invoice number which is found will be appended to a dataframe which already contains columns such as filename, text , preprocessed text. In the same way other details such as tax id, GST number can also be extracted and added to the data frame. The data frame is the final output of this project.

fig 9:DataFrame

In this way we can extract entire text from pdf or image files and also only certain fields/texts from a file.

Hope you found this article useful.

--

--