PandasDataFrame and LLMs: Navigating the Data Exploration Odyssey

6 min readFeb 10, 2024

Last few months I have been experimenting with RAG and also exploring llama_index. My progress has been very slow — and working more towards optimizing the prompts and vector searches. I experiment using Llama 2 models and recently I started basic exploration of codeLlama. In this post I want to show how we can use LLMs inside python noteBook and directly on our pandas df and then directly talk with our data. Many many blogs and videos have been made on this and for some people it may be a repeat.

Related posts that I have written —

Enhancing Retrieval Augmented Generation with — ReRanker and UMAP visualization: Llama_Index and Llama 2
Exploring RAG Implementation with Metadata Filters — llama_Index
Langchain agents and function calling using Llama 2 locally
Advance RAG — Query Augmentation using Llama 2 and LlamaIndex
Starting to Learn Agentic RAG

After connecting with a single dataframe — I want to work with more than 1 dataframe and I will show some of the challenges and the results I have obtained. We will be using pandasQueryEngine from llama_index.

Dataset — I used chatGPT to create basic dataset of customer and order tables as shown below.

import pandas as pd
import numpy as np

# Generate customer data
customer_data = {
    'Customer_ID': range(1, 51),
    'Name': ['Customer_' + str(i) for i in range(1, 51)],
    'City': ['City_' + str(np.random.randint(1, 11)) for _ in range(50)]
}
customers_df = pd.DataFrame(customer_data)

# Generate order data
order_data = {
    'Customer_ID': np.random.choice(range(1, 51), size=500),
    'Product_Name': np.random.choice(['Product_A', 'Product_B', 'Product_C', 'Product_D', 'Product_E'], size=500),
    'Order_Quantity': np.random.randint(1, 11, size=500),
    'Price_Per_Item': np.random.randint(10, 101, size=500),  # Assuming price range from 10 to 100
}

# Calculate total price for each order
order_data['Total_Price'] = order_data['Order_Quantity'] * order_data['Price_Per_Item']

orders_df = pd.DataFrame(order_data)

# Save dataframes to CSV files
customers_df.to_csv('customer_info.csv', index=False)
orders_df.to_csv('order_details.csv', index=False)

By default llama_index defaults to openAI — but I wanted to use locally and codellama — we can do that by defining service context as shown below. The link to complete code is at the end of the post and there you can see how to load the model and other details.

embeddings = HuggingFaceEmbedding()
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=1024,
                                               embed_model=embeddings,
                                               callback_manager=callback_manager)
set_global_service_context(service_context)

Next we will run a simple query on single dataframe. I am using orders_df and you can see we pass that as argument to PandasQueryEngine. Sometimes we have to babysit the model — tell it exact steps or output we need. This depends largely on the model we are using. For example — in the below image — if I don’t write the second line — ‘I want the product name as output’, then it just gives the maximum price. I think after few prompts one may get used to it and also understand the code may fail. Let us do some plotting now.

Basic Plots on single dataframe — orders_df

Next, I tried complex ones and I had to do few itertions to get it to do what I wanted. Initially I was getting the bars for different products next to each other — so after giving more details I was able to stack the bars for each product one over the other. I tried with percentages too — I wanted for each customer — out of total products that they bought, how many were % of Product_A, % of product_B , etc — could get it correct just once and I don’t have image for that.

Trying out complex plotting — you can see the generated query at top

Here comes by favourite part — using two dataframes. The way I approached was to write details of both the tables in the prompt everytime. i.e. I had fixed information and I would append that at start of each query. You can see the details below. Although I gave df=order_df as argument — I didn’t use it actually. I faced a big problem here — when the output query was referring to customers_df — it threw an error saying it can’t access the dataframe. I read a bit and found out that llama_index and pandasqueryengine uses safe_eval and the way everything is structured is that it can’t access the dataframe and variables in our workspace. So whenever I saw the code failed — I used to take the generated query and use that with eval to get my answer.

instruction_str = (
    "1.PRINT ONLY THE EXPRESSION.\n\
    2. Do not quote the expression.\n"
)

query_engine = PandasQueryEngine(df=orders_df,instruction_str=instruction_str,\
                                 verbose=False)

basic_info = "I have two pandas df and details are below.\
Generate python code for the query asked and give only query as the output.\
-------------------------------------------\
dataframe 1\
orders_df and it has columns\
'Customer_ID': unique ID for each customer,\
'Product_Name': Unique name or type of the product,\
'Order_Quantity': Number of items of particular product ordered,\
'Price_per_item': Price of single item,\
'Total_Price': Price_per_item*Order_quantity.\
Dataframe 2\
customers_df and it has columns\
'Customer_ID': unique ID for each customer,\
'Name': name of the customer,\
'City': City that the customer lives in and it is also the city of the order.\
--------------------------------------------\
Generate python code for the following query:\
"

Error while accessing more than one dataframe and the method to get around it using eval()

Now, be extra cautious — you can land into trouble using eval() — since you need to be sure what you are running inside the function. That is one reason that this framework or approach is not recommended for production application. Nevertheless, I do believe that if you are using within your system and only you are using it or your teammates and not sending anything outside — then it should be safe and stable.

I will end with plots on combined dataframe. Below you can see how the bars were stacked next to each other and then I had to write additional instructions to make it stack one over the other. You can definitely have commands other than plots — ask it to create new data frame and much more. The associated code file has some of the examples. Plots just make the blog visually attractive and also it is much easier to see and appreciate the differences arising due to different prompts.

I hope this was useful to you. BTW there is also something called as PandasAI which will do similar task for you — I am yet to explore that. Please get back to me -
1. If you have found a way to pass more than one dataframe or local variables to pandasqueryengine more effectively and securely?
2. would love to hear from you how you use LLMs to analyse data in python environment.
3. for any feedback, suggestions or just to appreciate the post.
Looking forward to connect with the coders !!

Code can be found here — https://github.com/SandyShah/llama_index_experiments It is not an optimized one but has different prompts that I tried.
Model used — codellama-7b-instruct.Q8_0.gguf — can be downloaded from https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF

PandasDataFrame and LLMs: Navigating the Data Exploration Odyssey

Written by Sandeep Shah