Tabular Data, RAG, & LLMs: Improve Results Through Data Table Prompting

Intel
Intel Tech
Published in
10 min readMay 14, 2024

How to ingest small tabular data when working with LLMs.

Photo by Mika Baumeister on Unsplash

By Eduardo Rojas Oviedo with Ezequiel Lanza

Say you’re a financial analyst working for an investment firm. Your job involves staying ahead of market trends and identifying potential investment opportunities for your clients, who are often curious about the world’s richest people and their sources of wealth. You might consider using a retrieval augmented generation (RAG) system to easily and quickly identify market trends, investment opportunities, and economic risks as well as answer questions like, “Which industry has the highest number of billionaires?” or “How does the gender distribution of billionaires compare across different regions?

Your first step is to go to the source to get that information. However, as you do an initial inspection, you discover a possible obstacle. The document contains not only text but also TABLES!

Image 1 : A table which includes a ranking of the world billionaires for 2023. It was extracted from Wikipedia (https://en.wikipedia.org/wiki/The_World%27s_Billionaires)

In this post, we’ll demonstrate how to use a large language model (LLM) model to consume tabular data in different formats.

LLMs and the Challenge of Structured Data

For us humans, it’s straightforward to connect the dots between the text and the table. But for LLMs, it’s like trying to solve a puzzle without all the pieces.

LLMs are designed to process text in a sequential manner, meaning they read and understand information one word or sentence at a time. This sequential processing is how they’ve been trained on vast amounts of text data. However, tables present information in a different format. They organize data into rows and columns, which creates a multidimensional structure.

Understanding this structure requires a different approach for LLMs than sequential text. LLMs need to recognize patterns across rows and columns, understand relationships between different data points, and interpret the meaning of headers and cell values. Because LLMs are primarily trained on sequential text, they might struggle to interpret and process tabular data. It’s like asking someone who’s used to reading novels to suddenly understand and interpret graphs or charts; they might find it challenging because it’s outside their usual mode of processing information.

Common Examples of Tabular Data

We can find tabular data more commonly than we might think. Below are three scenarios ranging from small, medium, to large:

  1. Small: Every day, users of our RAG platform will need to support their queries with documents that contain embedded small tabular data, as the example shown in Image 1. where the document contains mostly text and some tables. An industry report on global sales is another good example of small tabular data as it features textual analysis alongside tables detailing sales figures, which are often broken down by region and manufacturer. We can’t remove that tabular data to make it easy for our model to consume because it provides critical context.
  2. Medium: For some situations, you may need to analyze a greater amount of tabular data, such as spreadsheets, CSV files (comma-separated values), or data preformatted by our preferred reporting tools, such as Power BI. Say, for example, you’re working on a marketing team at a retail company, which is analyzing sales data from the past quarter to identify trends and opportunities for growth. This data would likely come in the form of multiple database files containing information on product sales, customer demographics, and regional sales performance.
  3. Large: A third scenario in which our users might wish to perform data analysis involves transaction databases and multidimensional datasets, such as OLAP cubes, because they offer advantages over spreadsheets for analyzing large amounts of complex data. OLAP provides fast query performance, support for complex calculations, and the ability to slice and dice data across multiple dimensions. OLAP cubes are better suited for handling the complexity and volume of data typically found in transaction databases or other large datasets, which require an understanding of the data information domain. For example, if you’re a retail chain analyzing data to enhance sales performance, you’ll need SQL for querying and OLAP cubes for analysis to garner insights from sales trends and customer segmentation to inform inventory management and marketing strategies. These tables can be large to analyze and understand.

In this article, we’ll explore the scenario where tabular data is embedded within textual information. We’ll work with small tables containing a dozen rows and a few columns, all within the model’s context window capacity.

Let’s See It in Action

Our example will be based on Figure 1, which shows the world billionaires list, to feed into our future RAG framework and answer questions about that billionaire list. With just four steps, we can demonstrate the use of tabular data as context for our users’ questions and even verify whether it’s possible to make the model incur errors.

Prepare our environment libraries

import os 
import sys
import pandas as pd
from typing import List

To facilitate the tabular display of the results, we will use the “BeautifulTable” library, which provides a visually attractive format in a terminal (beautifultable’s documentation).

!pip install beautifultable
from beautifultable import BeautifulTable

Since we will only be working with tables, let’s explore the capabilities of camelot-py, another useful python library. We can follow the camelot installation instructions at Installation of dependencies and Ghostscript dependency from Ghostscript Downloads.

!pip install opencv-python
!pip install camelot-py
!pip install ghostscript

import camelot

Prepare our data

We will save the list (the world’s billionaires) as a PDF to facilitate our exploration.

file_path="./World_Billionaires_Wikipedia.pdf"

The code below will allow us, in a basic way, to clean the data once it has been extracted as a Pandas dataframe. This routine processes page by page from a preselected list and will allow us to have access to a structure that is easy to manipulate.

Note: The below code was adapted by the author from Recursive Retriever + Query Engine Demo

# use camelot to parse tables   
def get_tables(path: str, pages: List[int]):
for page in pages:
table_list = camelot.read_pdf(path, pages=str(page))
if table_list.n>0:
for tab in range(table_list.n):

# Conversion of the the tables into the dataframes.
table_df = table_list[tab].df

table_df = (
table_df.rename(columns=table_df.iloc[0])
.drop(table_df.index[0])
.reset_index(drop=True)
)

table_df = table_df.apply(lambda x: x.str.replace('\n',''))

# Change column names to be valid as XML tags
table_df.columns = [col.replace('\n', ' ').replace(' ', '') for col in table_df.columns]
table_df.columns = [col.replace('(', '').replace(')', '') for col in table_df.columns]

return table_df
# extract data table from page number
df = get_tables(file_path, pages=[3])

Now, let’s convert our tabular data from data frame into multiple formats, such as: Json, CSV or Markdown, among others.

# prepare test set
eval_df = pd.DataFrame(columns=["Data Format", "Data raw"]) # , "Question", "Answer"

# Save the data in JSON format
data_json = df.to_json(orient='records')
eval_df.loc[len(eval_df)] = ["JSON", data_json]

# Save the data as a list of dictionaries
data_list_dict = df.to_dict(orient='records')
eval_df.loc[len(eval_df)] = ["DICT", data_list_dict]

# Save the data in CSV format
csv_data = df.to_csv(index=False)
eval_df.loc[len(eval_df)] = ["CSV", csv_data]

# Save the data in tab-separated format
tsv_data = df.to_csv(index=False, sep='\t')
eval_df.loc[len(eval_df)] = ["TSV (tab-separated)", tsv_data]

# Save the data in HTML format
html_data = df.to_html(index=False)
eval_df.loc[len(eval_df)] = ["HTML", html_data]

# Save the data in LaTeX format
latex_data = df.to_latex(index=False)
eval_df.loc[len(eval_df)] = ["LaTeX", latex_data]

# Save the data in Markdown format
markdown_data = df.to_markdown(index=False)
eval_df.loc[len(eval_df)] = ["Markdown", markdown_data]

# Save the data as a string
string_data = df.to_string(index=False)
eval_df.loc[len(eval_df)] = ["STRING", string_data]

# Save the data as a NumPy array
numpy_data = df.to_numpy()
eval_df.loc[len(eval_df)] = ["NumPy", numpy_data]

# Save the data in XML format
xml_data = df.to_xml(index=False)
eval_df.loc[len(eval_df)] = ["XML", xml_data]

It’s time to explore our test data. We have configured a dataset where each row represents an output format from dataframe and the data in “Data raw” corresponds to the tabular data that we will use with the generative model.

from pandas import option_context
with option_context('display.max_colwidth', 150):
display(eval_df.head(10))

Output:

Image 2: A code output that showcases the raw data for each text format

Set our model for validation

Let’s prepare a basic prompt that allows us to interact with the context data.

MESSAGE_SYSTEM_CONTENT = """You are a customer service agent that helps a customer with answering questions. 
Please answer the question based on the provided context below.
Make sure not to make any changes to the context, if possible, when preparing answers to provide accurate responses.
If the answer cannot be found in context, just politely say that you do not know, do not try to make up an answer."""

Before carrying out our tests with the tabular dataset, we will need to prepare our model’s connection settings (in this example, we’ll use AzureOpenAI). You’ll need to provide your credentials.

from openai import AzureOpenAI

client = AzureOpenAI(
api_key=OAI_API_Key,
api_version=OAI_API_Version,
azure_endpoint=OAI_API_Base)

def response_test(question:str, context:str, model:str = "gpt-4"):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": MESSAGE_SYSTEM_CONTENT,
},
{"role": "user", "content": question},
{"role": "assistant", "content": context},
],
)

return response.choices[0].message.content

Since we are working with a dataset, where each row represents an individual unit of context information, we have implemented the following iteration routine, allowing us to process one row after the other and store the model interpretation for each one.

def run_question_test(query: str, eval_df:str):

questions = []
answers = []

for index, row in eval_df.iterrows():
questions.append(query)
response = response_test(query, str(row['Data raw']))
answers.append(response)

eval_df['Question'] = questions
eval_df['Answer'] = answers

return eval_df

def BeautifulTableformat(query:str, results:pd.DataFrame, MaxWidth:int = 250):
table = BeautifulTable(maxwidth=MaxWidth, default_alignment=BeautifulTable.ALIGN_LEFT)
table.columns.header = ["Data Format", "Query", "Answer"]
for index, row in results.iterrows():
table.rows.append([row['Data Format'], query, row['Answer']])

return table

Let’s have FUN!

Now, we’ll connect the dots, processing the dataset, from which we will obtain an answer using each of the tabular data formats as context information, and then we will display the results in a tabular manner.

query = "What's the Elon Musk's net worth?"
result_df1 = run_question_test(query, eval_df.copy())
table = BeautifulTableformat(query, result_df1, 150)
print(table)

Output:

Image 3 depicts a table outlining the responses provided by a model for diverse input text formats to the inquiry, “What is Elon Musk’s net worth?”

We can see how the question, “What’s Elon Musk's net worth?” was consistently answered for each tabular data format obtained during the Panda data frame conversion. As we know, because it consists of a semantic elaboration, the variations between responses are a challenge that we must consider during the generation of final validation metrics

We could also obtain more concise or elaborate responses if we modify our “MESSAGE_SYSTEM_CONTENT” variable.

Let’s repeat the exercise once again, this time with a question that requires more analytical reasoning for our model.

query = "What's the sixth richest billionaire in 2023 net worth?"
result_df2 = run_question_test(query, eval_df.copy())
table = BeautifulTableformat(query, result_df2, 150)
print(table)

Output:

Image 4 displays a table presenting the responses provided by a model to various formats of input text for the question, “What is the net worth of the sixth wealthiest billionaire in 2023?
As with the previous example, we have used each of the Pandas data frame export formats as a query’s information context.context for the quer

As with the previous example, we have used each of the Pandas dataframe export formats as context for the query.

In this example, the question "What's the sixth richest billionaire in 2023 net worth?" proves that the model can respond to something more abstract, such as "the sixth richest billionaire," which involves greater analytical reasoning and tabular data calculation. Both challenges were resolved with excellent consistency

Relevancy and Distraction

Let’s play around with the model and check that our prompt and data context work as we expect.

query = "What's the Michael Jordan net worth?"
result_df3 = run_question_test(query, eval_df.copy())
table = BeautifulTableformat(query, result_df3, 150)
print(table)

Output:

Image 5 showcases a table illustrating the responses generated by a model for different input text formats in response to the question, “What is Michael Jordan’s net worth?”

With this test, we’ve proposed a question that has no correct answer in the context of the information provided. Our goal is to ensure that our model does not respond with a hallucination or false positive (a false response that seems true).
The answer to our “What’s Michael Jordan’s net worth?” was resolved consistently for each data format, as we expected (there was no answer to the question).

Let’s provide another example, which would possibly mislead an unsuspecting user, by using a name that significantly resembles one existing in the tabular data.

query = "What's Michael Musk's net worth?"
result_df4 = run_question_test(query, eval_df.copy())
table = BeautifulTableformat(query, result_df4, 180)
print(table)
Image 6 presents a table delineating the responses generated by a model across various input text formats in answer to the question, “What is Michael Musk’s net worth?”

With the question “What’s Michael Musk's net worth?” where “Musk” could make us misinterpret the question, the model has nevertheless solved the challenge satisfactorily.

Conclusion

By ingesting small tabular data from textual documents, we’ve seen how an LLM can understand the context of a table even when we tried to trick it by asking questions with wrong information. It’s evident that preserving structural integrity while extracting contextually embedded tables is relevant. These tables often contain critical contextual information, enhancing the surrounding text’s comprehension to provide more accurate results.

Focusing on small-embedded tables, it’s essential to recognize their significance in providing contextual clues to your RAG framework. Leveraging libraries like Camelot for table extraction ensures the preservation of these structures. However, maintaining relevance without distraction poses challenges, as demonstrated in model testing. By doing so, we provide essential context to models like GPT, enabling them to generate accurate and contextually relevant responses within broader textual contexts.

Call to Action

If you want to explore more, there are a series of alternative libraries that facilitate the extraction of tabular data, such as: Tabula, pdfplumber, pdftables, and pdf-table-extract.In “Comparison with other PDF Table Extraction libraries and tools,” Vinayak Mehta developed a comparative performance evaluation of the listed libraries that you might find useful.

About the Authors

Eduardo Rojas Oviedo, Platform Engineer, Intel

Eduardo Rojas Oviedo is a dedicated RAG developer within Intel’s dynamic and innovative team. Specialized in cutting-edge developer tools for AI, Machine Learning, and NLP, he is passionate about leveraging technology to create impactful solutions. Eduardo’s expertise lies in building robust and innovative applications that push the boundaries of what’s possible in the realm of artificial intelligence. His commitment to sharing knowledge and advancing technology drives his ongoing pursuit of excellence in the field.

Ezequiel Lanza, Open Source AI Evangelist, Intel

Ezequiel Lanza is an open source AI evangelist on Intel’s Open Ecosystem team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools. He holds an MS in data science. Find him on X at @eze_lanza and LinkedIn at /eze_lanza

Follow us!

Medium, Podcast, Open.intel , X , Linkedin

--

--

Intel
Intel Tech

Intel news, views & events about global tech innovation.