How to Annotate PDFs and Scanned Images for NLP Applications using UBIAI Text Annotation Tool

Wiem Souai
UBIAI NLP
Published in
4 min readApr 8, 2024

Automating the retrieval of information from various documents like receipts, contracts, and invoices is pivotal for enhancing business efficiency and productivity while cutting costs. However, this transformative capability heavily relies on text annotation. While natural language processing (NLP) techniques such as Named Entity Recognition (NER) or relation extraction have been effectively utilized for information retrieval in unstructured text, deciphering structured documents like invoices, receipts, and contracts presents a more intricate challenge.

Primarily, the entities we aim to extract (e.g., price, seller, tax) lack substantial semantic context that could train an NLP model effectively. Moreover, the layout of documents often varies significantly from one instance to another, rendering traditional NLP tasks like NER less effective on structured documents. Nevertheless, structured texts, such as invoices, inherently contain valuable spatial information regarding the entities they encapsulate. Leveraging this spatial information enables the creation of a 2-D position embedding, signifying the relative position of tokens within a document.

Recently, Microsoft introduced LayoutLM, a groundbreaking model designed to simultaneously capture interactions between text and layout information within scanned document images. This innovative approach has yielded remarkable advancements in various downstream tasks, notably enhancing form understanding, receipt comprehension, and document image classification. For instance, LayoutLM has elevated performance metrics across the board, from form understanding (increasing from 70.72 to 79.27) to receipt understanding (rising from 94.02 to 95.24), and even document image classification (improving from 93.07 to 94.42), establishing new benchmarks in these domains.

Scanned Images and PDF annotation

To fine-tune the LayoutLM model for custom invoices, annotated data containing the bounding box coordinates of each token and their linking is necessary. This annotated data ensures that the model learns to understand the layout and content of the invoices accurately. However, finding an annotation tool that seamlessly integrates OCR parsing and annotation directly on native PDFs and images can be challenging. Many existing annotation tools either come with a high price tag or lack comprehensive OCR support, necessitating an external OCR step before annotation.

At UBIAI, we’ve developed an end-to-end solution to address this gap. Our platform enables direct annotation on native PDFs, scanned images, or images from your device without compromising on the integrity of document layout information. This capability is particularly advantageous for invoice extraction, where both text sequence and spatial information are crucial for accurate understanding.

Here’s how it works: Simply upload your PDF, JPG, or PNG files directly to our platform and commence annotation. Leveraging cutting-edge OCR technology from AWS Textract, UBIAI swiftly parses your document, extracting all tokens along with their respective bounding boxes. From there, annotators can effortlessly highlight tokens either on the original document (displayed in the right panel) or the parsed text (shown in the left panel) and assign appropriate labels.

In addition to entity labeling, our platform supports relations annotation and text classification labeling, offering a comprehensive toolkit for fine-tuning the LayoutLM model to suit your specific needs. With UBIAI, annotating invoices becomes a streamlined and efficient process, ensuring optimal model performance and accurate extraction of crucial information.

Invoice Pre-annotation

In addition, you can pre-annotate your invoices using dictionaries, regular expressions (for example to find dates, emails, names, etc.) or a pre-trained ML model.

Annotation Export

Once you finish the annotation, simply export the annotated documents in JSON format:

Conclusion

UBIAI’s OCR annotation feature streamlines the process of training NLP models by offering a user-friendly and precise labeling interface. Gone are the days of grappling with pre-processing your images using external APIs or establishing rules for pre-annotating your documents. With UBIAI, the workflow is straightforward: upload your documents, annotate them with ease, and export the annotated data effortlessly.

By eliminating the need for complex pre-processing steps and manual rule creation, UBIAI significantly reduces friction in the model training process. This means you can focus more on refining your annotations and improving model performance, rather than getting bogged down in tedious preparatory tasks.

With our intuitive interface and accurate OCR capabilities, annotating documents becomes a seamless experience. Whether you’re working with native PDFs, scanned images, or photos from your device, UBIAI’s OCR annotation feature ensures that you can annotate with precision and efficiency.

Once your documents are annotated to your satisfaction, exporting the data is a breeze. UBIAI empowers you to seamlessly integrate annotated data into your NLP pipeline, allowing you to train models with confidence and achieve optimal results.

With UBIAI’s OCR annotation, training NLP models has never been easier. Say goodbye to cumbersome pre-processing and hello to a streamlined, friction-free annotation process.

--

--