Revolutionize your Data Extraction Process with OCR and NLP

Takoua Saadani
UBIAI NLP
Published in
4 min readJan 12, 2023

If you’ve ever wondered how you can automate data extraction from your goods receipts and shipment documents, then you’ve come to the right place.

In this article, we’ll explain how Natural Language Processing can quickly and easily extract data from semi-structured documents using OCR, labeling, and fine-tuning models.

The extraction of information from receipts and shipment documents can be divided into four major steps.

1 . First, import the relevant documents (shipment, receipts) into the software.

2 . Using Optical Character Recognition (OCR) tools, you define the data you want to extract by annotating it in the uploaded documents.

3 . After that, you can train an AI model to automatically identify your data.

4. Finally, you can export your annotated data in different formats or use it to train the model outside of the platform.

But, before we get there, let’s first define Optical Character Recognition (OCR) and why it’s crucial for automated data extraction from shipment documents and receipts.

What is OCR and how does it extract data ?

It is a technology that detects text in digital images. It is frequently used to detect text in scanned documents and images. OCR software can be used to convert hand-written text, physical paper documents, native PDFs, or images into machine-readable text that can be processed, stored, edited, and used to train machine learning models and with the right layout.

OCR and NLP solutions can process scanned receipts and waybills efficiently and quickly while avoiding traditional constraints such as layouts or human errors, allowing supply chain companies to save time spent on manual verification while lowering processing costs.

Goods receipts

A goods receipt is a document associated with accounts payable in which the supplier of goods provides evidence that the goods have been received by the purchaser so that payment can be made to the supplier.

OCR along with NLP allows you to extract the massive amount of data in these documents like purchase order number, manufacturer’s serial numbers, delivery notes, bill of lading, customs documentation, card tender, cash tender, date, merchant address, name and phone number, receipt number, subtax, tax amount, total amount, etc.

Loading and Transport Documents

Transport documents are contracts for the carriage of goods that are exchanged between various actors.

They vary depending on the mode of transportation, such as Bill of Lading, Sea Waybill, Consignment Note (CMR), Air Waybill (AWB), and Rail Consignment Note (CIM).

With the help of Natural Language Processing, OCR technology can be used to extract vehicle registration plates, trailer numbers, container numbers, driver’s licenses, and other information.

This will assist supply chain companies in ensuring that the correct delivery is loaded onto the correct vehicle or container and entered into the shipment document that comes with the vehicle.

1. Uploading your documents

The first step, as mentioned above, is uploading the document to be extracted into the software.

You can also upload and convert scanned or printed documents, PDFs, invoices, receipts, images, and other semi-structured documents into digital files that can be processed by a computer using OCR technology.

2. Annotating your text

Receipts usually have a line-by-line format, very similar to the invoices and contracts’ layout, which is why we will be using OCR technology to annotate the types of data we want to extract, such as merchant address, name and phone number, receipt number, tax amount, total amount, and so on.

Using UBIAI can help you save time and effort since it supports annotation in multiple languages and includes several custom metadata types such as names, numbers, dates, etc.

3. Extracting Data

After you’ve annotated a few examples, you can use the Model-Assisted labeling feature to assist you in labeling the metadata in your documents.

First, the model will ask you to check the extracted data for errors and correct them.

After that, the model learns from its mistakes. It eventually improves and can function without supervision.

Video Tutorial on UBIAI’s model assisted labeling

4. Exporting Data

When the annotation operations are completed, you can export the annotated dataset in various formats (such as spacy, IOB, Amazon comprehend, and so on) or download your custom trained model with the click of a button using UBIAI.

Conclusion

With more advances in AI and its various subfields, optical character recognition is more accurate and efficient than ever before.

It is a significant improvement over the traditional system of manually classifying tons of receipts and shipment documents.

UBIAI can help you digitize documents, annotate your data, train and deploy AI models all on one platform.

It has enormous cost-saving and time-saving potential, and its convenience is becoming increasingly valued by various supply chain businesses.

--

--

Takoua Saadani
UBIAI NLP

MSc in Projects Management I Associate Structural Engineer I Marketer