Parsing PDFs(text, image and tables) for RAG based applications using LlamaParse (LlamaIndex).

Salujav
5 min readMar 20, 2024

In this article, we are going to show how the recent LlamaParse Reader update from LlamaIndex is going to help us in extracting important data (in particular to fetching numeric data) and presenting it to GPT-4 to build a complete PDF based chatbot.

Introduction

The landscape of Artificial Intelligence I has been transformed by the rapid advancements in Large Language Models (LLMs), offering unprecedented capabilities in natural language understanding and generation. OpenAI’s GPT models have taken the lead in this new era of language comprehension and generation, thanks to their remarkable abilities honed on vast amounts of online data. These models have expanded our possibilities, allowing us to engage with AI-powered systems in unprecedented ways. Nevertheless, like any technological marvel, LLMs have their own limitations to consider. One notable issue is their occasional tendency to provide information that may be inaccurate or outdated. Furthermore, these LLMs do not provide the sources of their responses, making it challenging to ascertain the reliability of their output. This limitation becomes particularly crucial in contexts where accuracy and traceability are of utmost importance.

Advancements in AI have facilitated significant progress in simplifying the consumption of vast amounts of data for users, primarily through basic question answering. By utilizing applications based on RAG (Retrieval-Augmented Generation) models, we can develop comprehensive chat systems that incorporate PDFs into the conversation. These systems enable users to interact with AI-powered chatbots that can provide information and address queries based on the contents of PDF documents. This approach leverages the power of AI to enhance access to information in a user-friendly manner.

WHY RAG?

Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.

A simple PDF chat based system
A simple RAG-based system for document Question Answering

Retrieval-augmented generation (RAG) has been developed to enhance the quality of responses generated by large language models (LLMs). RAG achieves this by incorporating external sources of knowledge to complement the internal representation of information within the LLM. By implementing RAG in a question answering system based on LLMs, two primary advantages can be observed. Firstly, it ensures that the model has access to up-to-date and reliable facts, thereby improving the accuracy of its responses. Secondly, it enables users to access the sources used by the model, allowing them to verify the claims made by the system and establish trust in its output.

While there is some validity to the statement, it is important to note that implementing retrieval-augmented generation (RAG) with entirely unstructured data, such as a wide range of web pages or PDFs, is likely to yield unsatisfactory results. Ideally, the most effective approach involves working with structured data. However, in numerous situations, RAG is employed precisely because structured data is either unavailable or accessing it is not feasible.

Lets consider the scenario where we have a pdf or multiple pdfs with vast amount, in tables or in the form of figures, such as a financial, a sustainability or an employee report of a global company or a cluster of companies. There have been many advancements from the AI open-source based communities such UnstructuredIO, Adobe PDF Extract API or the most latest and effective the LlamaParser API from LlamaIndex.

LlamaParser

LlamaParse is a state-of-the-art parser designed to specifically unlock RAG over complex PDFs with embedded tables and charts.

Their proprietary parsing service has been developed to excel in parsing PDFs containing intricate tables, converting them into a meticulously structured markdown format. This specific representation seamlessly integrates with the open-source library’s sophisticated Markdown parsing and recursive retrieval algorithms. As a result, it becomes possible to construct a retrieval-augmented generation (RAG) system that can effectively answer questions pertaining to both tabular and unstructured data within complex documents.

Here’s how to get started with Llamaparse:

  1. Install the necessary dependencies,
pip install -U llama-index --upgrade --no-cache-dir --force-reinstall
pip install llama-parse

LlamaIndex’s new integrations with Langchain also allows us to use multi-dependencies where we can use LlamaParse as a file reader and rest we can use additional API to build the complete RAG system.

Here we will be using complete LlamaIndex to build our index and retrieval.

Llamaparse uses the LlamaCloud technology for their processing which also leverages API caching so that the same pdf reuploaded after the first time takes a significantly lesser time to parse.

2. Loading and parsing the report,

For this article we are going to use the sample Apollo Staff development report.

import nest_asyncio
nest_asyncio.apply()
from llama_parse import LlamaParse # pip install llama-parse
from llama_index.core import SimpleDirectoryReader # pip install llama-index

parser = LlamaParse(
api_key="(Your API key here)",
result_type="markdown"
)

documents = await parser.aload_data('/sample report')

3. As the parsed text contains everything (text, table, image, etc..) in markdown form, we will be using the MarkdownElementNodeParser which will store the markdown information in nodes.

from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-3.5-turbo"), num_workers=4)
nodes = node_parser.get_nodes_from_documents(documents=[documents[0]])
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

4. Now Lets set up our vector store using the base nodes and objects. We will be using the VectorStoreIndex from LlamaIndex for this purpose.

from llama_index.core import VectorStoreIndex
recursive_index = VectorStoreIndex(nodes=base_nodes+objects)

5. Now lets load and test our query engine..

Source:

It seems like apart from the score, it also recognizes color and what the colors represent.

You can find the complete code in the notebook below.

https://colab.research.google.com/drive/1aUPywCH92XLNpdjkmXz3ff8H-QnT2JHZ?usp=sharing

Conclusion:

The field of Artificial Intelligence is continuously evolving, and it is heading towards a more promising direction. Presently, there is a widespread desire among developers to construct their products in the most efficient manner, utilizing the latest APIs. Parsing and contextualizing play a crucial role in the development of RAG applications, as they are fundamental components.

There is no right or wrong RAG pipeline. Every case is different and its upto your case to decide which approach might be best for you.

I hope that this article helps you in building your next optimized RAG pipeline with the help of LlamaIndex.

Thank you for reading the above article. If you have any questions or have a knack for new AI technologies, follow me on linkedin:

https://www.linkedin.com/in/vansh-saluja-3112b9162/

--

--