unstructured library | Get the JSON and HTML versions of any PDF (legal, financial, medical…), even PDF with tables!

4 min readDec 15, 2023

Image credit: ChatGPT4 with the prompt “Creating a new office scene with three distinct computer monitors, each clearly displaying one of the following labels: ‘PDF’, ‘JSON’, and ‘HTML’. The setting is a professional and inviting office with at least one person working. Ensure that the labels are large and easily readable: the left monitor displays ‘PDF’, the middle monitor ‘JSON’, and the right monitor ‘HTML’. The monitors should clearly exhibit these labels, representing different formats (…). — Image credit: ChatGPT4 with the prompt “Creating a new office scene with three distinct computer monitors, each clearly displaying one of the following labels: ‘PDF’, ‘JSON’, and ‘HTML’. The setting is a professional and inviting office with at least one person working. Ensure that the labels are large and easily readable: the left monitor displays ‘PDF’, the middle monitor ‘JSON’, and the right monitor ‘HTML’. The monitors should clearly exhibit these labels, representing different data formats. The office environment is warm and welcoming, reflecting a productive and engaging workspace.”.

The importance of extracting structured content from PDFs, including tables, is underscored for their use with Large Language Models (LLMs) like ChatGPT4 or Llama2, especially when dealing with substantial volumes of documents. The unstructured library plays a vital role in processing both structured and unstructured data, streamlining the conversion of data into formatted outputs suitable for tasks like Retrieval Augmented Generation (RAG). This streamlined approach, which efficiently transforms PDFs into structured formats such as JSON/HTML with minimal coding, is essential for a wide range of users including companies and individuals handling large quantities of PDFs, such as financial reports, legal and administrative documents, and medical records. This post provides an operational response to this objective by providing the code in a notebook.

Why do we need structured content from any PDF, even PDFs with table(s)?

Today, everyone is talking about using LLMs (Large Langue Models) like ChatGPT4 or Llama2 to answer questions about documents (txt, ppt, docx, PDF…) or even extract all the key information from them.

Indeed, all companies, administrations, organizations and the majority of individuals (such as lawyers, doctors, architects, etc.) produce documents and very often in PDF format.

The problem comes when it comes to finding specific information in these documents, and especially when there are hundreds, thousands or more of them! The current fashion is to extract the text from PDFs to pass it to an LLM which will then be able to find the information sought and to deliver it. These are RAG (Retrieval Augmented Generation) systems.

However, we understand that the problem lies in how the content of the PDFs is extracted and then how it is presented to the LLM. Take the example of tables in PDFs. If its content is presented in the form of a series of texts, this means that all structural information (rows and columns) has been lost. Conversely, if we have succeeded in extracting this table in the structured form of a table (for example with the HTML tags of the tables), not only will we be able to provide the LLM with all the textual data but also their relationships and finally, that the information as well that all data is in a table! This way, we will make much easier the LLM research task and improve the quality of its response.

About the unstructured library

unstructured is a library designed for handling both structured and unstructured data. In the context of data processing, “structured” data refers to information that adheres to a specific format or schema, such as databases, spreadsheets, tables and other organized forms. On the other hand, “unstructured” data lacks a specific format or organization, like text documents, emails, social media posts, and multimedia content.

The main goal of this library is to streamline the preprocessing of both types of data for various downstream tasks. Downstream tasks in data processing usually involve analysis, machine learning, data visualization, or other forms of information processing that require the data to be in a clean, organized format as JSON or HTML.

For example, we need this data processing to enter clean structured data in RAG (Retrieval Augmented Generation) systems or directly in a LLM (Large Language Model) as ChatGPT4 or Llama2.

In essence, the unstructured library is designed to:

Accept Data in Various Formats: It can process data regardless of where it is stored or in what format, be it structured databases or unstructured text files.
Transform and Preprocess Data: The library performs the necessary transformations and preprocessing steps to convert the data into a format that is more suitable for analysis like JSON and other downstream processes.
Ease of Use: By simplifying this process, the library helps users focus on the analysis or task at hand, rather than spending time and resources on data cleaning and preparation.
Versatility: It is useful for a wide range of applications, from business intelligence to machine learning, and domains (financial, medical, legal, administration…) where handling different types of data efficiently is crucial.

This kind of library is particularly valuable in the era of big data, where organizations often have to deal with massive and diverse datasets (for example, PDFs). By providing a streamlined way to preprocess this data, such tools can significantly reduce the time and effort required to derive insights and value from the data.

Use case with a financial report

Consider a financial report available on the Internet in PDF format. This type of document is full of important information on the activity of the company or organization produced with numerous paragraphs…. and many tables!

The objective is to be able to obtain a JSON and/or HTML file of this content, ie structured content with in particular all the structural information for the tables.

This is what the unstructured library allows you to do with very few code lines!

Just download the notebook to be able to do it with any PDF:

language-models/Unstructured_PDF_to_JSON_and_HTML.ipynb at master · piegu/language-models

pre-trained Language Models. Contribute to piegu/language-models development by creating an account on GitHub.

github.com

You will find the original PDF, and the output files in JSON and HTML here:

Et voilà :-)

About the author: Pierre Guillou is an AI consultant (Generative AI & Deep Learning) in Brazil and France. Contact him via his LinkedIn profile.