Member-only story

Document Parsing Using Large Language Models — With Code

You will not think about using Regular Expressions anymore

Zoumana Keita
Towards Data Science
14 min readJul 25, 2024

--

Motivation

For many years, regular expressions have been my go-to tool for parsing documents, and I am sure it has been the same for many other technical folks and industries.

Even though regular expressions are powerful and successful in some case, they often struggle with the complexity and variability of real-world documents.

Large language models on the other end provide a more powerful, and flexible approach to handle many types of document structures and content types.

General Workflow of the system

It’s always good to have a clear understanding of the main components of the system being built. To make things simple, let’s focus on a scenario of research paper processing.

Documents Parsing Workflow With LLM (Author: Zoumana Keita)
  • The workflow has overall three main components: Input, Processing, and Output.
  • First, documents, in this case, scientific research papers in PDF formats are submitted for processing.
  • The first module of the processing component extract raw data from each PDF and combine that to the prompt containing instructions for the large language model to…

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Zoumana Keita
Zoumana Keita

Written by Zoumana Keita

Senior Data Scientist/IT Analyst @OXY || Videos about AI, Data Science, Programming & Tech 👉 https://www.youtube.com/@techwithzoum

Responses (7)