Unstructured data ETL in 2024

Thibaut Gourdel
4 min readJul 3, 2024

Mostly due to the GenAI wave and the vast array of new use cases it opens, unstructured data ETL has gained popularity over the last few months. In this article, I’ll go through the state of unstructured data ETL as of mid-2024 and touch on where we’re headed.

👟 A step back

Data extraction from unstructured data, mostly file documents, isn’t new and has been around for decades. Many businesses used to and still rely on documents for purchase orders, receipts, legal documents, printed contracts, reports — you name it. Unfortunately, these documents (PDFs, Word docs, etc.) don’t have the same structure depending on the vendors or even the industry and the norms it has to comply with. Being able to extract the right information and structure it logically has always been a challenge. The first solutions involved common parsing techniques and rapidly evolved to use OCR (Optical Character Recognition) techniques. Many services address this need, starting with independent services such as Adobe PDF Services to parse their very own PDF files, as well as cloud providers like Amazon with Textract or Azure AI Document Intelligence. Many open-source libraries also offer different levels of parsing, from simple text extraction to OCR engines like Tesseract, one of the most popular open-source OCR engines. Nowadays, most OCR engines have evolved to use machine learning and deep learning. For example, Tesseract introduced the use of neural net since the release of Tesseract 4 in 2018. The implementation of machine learning and deep learning brought a leap forward for OCR engines, providing better accuracy in extracting data from documents and converting images to text.

🌱 What’s new then?

What’s really new is the use of transformer-based models (e.g. Large Language Models) that are able to understand context and process documents. While OCR engines are now very capable of extracting the text and content of documents, we can use GenAI to summarize, classify, and even extract structured data from them. However, some challenges still remain. GenAI models do understand the text very well, but it can still be challenging for them to understand the hierarchy of the text they process. Information hierarchy is very important and humans heavily rely on it. Documents usually have titles, paragraphs, bold and italic sentences, etc. We find the same hierarchy in websites as well through HTML. To maximize document understanding by LLMs, which is particularly useful for RAG pipelines, new tools have appeared to help address this very issue. Let’s take a look at a few of them:

  • Unstructured: Naturally starting with Unstructured, which was early in providing unstructured data preprocessing for various ML tasks and most recently for RAG pipelines. What is very interesting about this library is that it natively breaks the content into elements such as titles, narrative text, list items, tables, etc. The only drawback is that it’s somewhat of a black box, and you rely on their predefined choices on how to hierarchize the content which has pros and cons of course. For most of us, it will be so much easier and time-saving to just go with Unstructured.
  • LangChain & LlamaIndex: The two most popular libraries for working with LLMs also include preprocessing components to split unstructured data and extract elements from HTML or Markdown, among others. However, you still need to have a good understanding of your document’s structure and how the libraries work. This offers greater flexibility but requires more time, setup and effort in general.

These libraries are meant to get your unstructured data ready to work with LLMs, but you can also use them alongside the other document processing services I mentioned earlier (Amazon Textract, Azure AI Document). There are many other frameworks and services launching in this space, and like everything in AI, it’s evolving extremely fast.

Amphi ETL is a low-code and Python-based ETL tool for both structured and unstructured data. It allows you to develop data pipelines graphically and generate Python code that you own and can deploy anywhere. Amphi provides input connectors for PDFs, Word files, and HTML. It also includes RAG components such as chunking, embedding, and integration with vector stores. Amphi is free and open, give it a try!

RAG pipeline with Amphi ETL

Github: https://github.com/amphi-ai/amphi-etl

--

--

Thibaut Gourdel

I write about data engineering and ETL. I'm building Amphi, a low-code python-based ETL for data manipulation and transformation.