How to use AI Agents to Analyze and Process CSV Data: A Comprehensive Guide

Cubode Team
8 min readJun 27, 2024

--

Have you ever wondered how AI agents understand tabulated data, such as those in CSVs or Excel files? Have you tried loading a CSV to Chat GPT, and it automatically understands the file and can begin to process the information? Well, in this article, we’ll be building this from scratch.

Context

Ben here 👋, Cubode’s AI Engineer (well, I technically lead the product team, but I find myself doing many things!). We are building an AI agent in 30 days that, upon the upload of a data file, generates bespoke interactive charts to help visualize that data.

To do this, we needed to ensure the agent understands the data within the file, so it can generate the correct charts to use.

But for the agent that we are creating, we expect it to suggest the type of chart to use itself. Therefore to be able to do that, it first needs to understand what type of data is within the file, and for that, we need to create augmentation.

Here are the main steps we decided to break this down into:

  1. Create Context
  2. Augmentation of Prompt
  3. Code Generation
  4. Code Execution
This is the flow we decided to aim for to enable AI systems to ingest CSV data and aggregate it

This is what we did 👇

SetUp

First, we need to do some house keeping with some package installs and defining the LLM we will use.

%pip install langchain_openai
%pip install langchain_core
%pip install langchain_community
%pip install langchain_experimental
from langchain_openai.chat_models import ChatOpenAI

openai_key = os.environ.get("OPENAI_API")
gpt4o = ChatOpenAI(temperature=0.0, model="gpt-4o", openai_api_key=openai_key)

1. Context Creation

Before we can generate the code to process the raw data, we first need to create context around the dataset to augment our prompt. Creating context enables the language model to craft a more accurate response to a query. In this case, we extracted the following metadata from the dataset:

  1. Number of Columns: Dimensionality
  2. Schema: Column names
  3. Data Types: Data types for each column
  4. Sample: One sample row of the dataset

For this example, we’ll use a wine reviews dataset which can be found here 👈.

This is the code for the metadata extraction.

import pandas as pd
def extract_metadata(df):
metadata = {}

# Number of columns
metadata['Number of Columns'] = df.shape[1]

# Column names
metadata['Schema'] = df.columns.tolist()

# Data types of each column
metadata['Data Types'] = str(df.dtypes)

# Summary statistics
metadata['Sample'] = df.head(1).to_dict(orient="records")

return metadata
df = pd.read_csv("wine_reviews.csv")
metadata = extract_metadata(df)

The metadata creation looks something like this 👇.

{
'Schema': ['country',
'description',
'designation',
'points',
'price',
'province',
'region_1',
'region_2',
'variety',
'winery'
],
'Data Types': 'country object
description object
designation object
points int64
nprice float64
province object
region_1 object
region_2 object
variety object
winery object',
'Sample': [{ 'country': 'US',
'description': 'This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.',
'designation': "Martha's Vineyard",
'points': 96,
'price': 235.0,
'province': 'California',
'region_1': 'Napa Valley',
'region_2': 'Napa',
'variety': 'Cabernet Sauvignon',
'winery': 'Heitz'
}]
}

2. Augmentation of Prompt

To help the language model understand the context of the dataset, we needed to augment the prompt with the extracted metadata.

We chose to do this by creating prompt templates and using the metadata as variables. This allowed us to format the metadata (schema, data types, and sample) into the prompt template.

Here is an example of the prompt template we used 👇.

prompt_template = '''
Assistant is an AI model that takes in metadata from a dataset
and suggests charts to use to visualise that data.

New Input: Suggest 2 charts to visualise data from a dataset with the following metadata.


SCHEMA:

--------

{schema}

DATA TYPES:

--------

{data_types}

SAMPLE:

--------

{sample}

'''.format(schema = metadata["Schema"], data_types = metadata["Data Types"], sample=metadata["Sample"])

gpt40.invoke(prompt_template)
'Assistant: 
Based on the metadata, I would suggest the following two charts to visualize the data:
**Chart 1: Bar Chart - Top 10 Wineries by Average Points**
* X-axis: Winery names
* Y-axis: Average points
* This chart would help identify the top-performing wineries in terms of average points received. It would be interesting to see which wineries consistently produce high-scoring wines.
**Chart 2: Scatter Plot - Price vs. Points**
* X-axis: Price
* Y-axis: Points
* This chart would help visualize the relationship between the price of a wine and its quality (as measured by points). It could reveal interesting patterns, such as whether higher-priced wines tend to receive higher points or if there are any outliers that defy this trend.'

3. Example: Simple Agent Code Generation & Execution

Through the prompt augmentation with the extracted metadata from the wine reviews dataset, the LLM was able to suggest two charts to use to visualize some data. It correctly selected the chart types with the correct column types as well, as it suggested a Bar Chart to plot a category (Winery Name) against the average of a numerical column (Points).

But to actually create the plot, we needed to take the raw dataset and aggregate it, to find the average Points for each Winery.

For this, we iterated on an agentic workflow with a Python REPL tool. Sometimes, for complex calculations, rather than have an LLM generate the answer directly, it can be better to have the LLM generate code to calculate the answer, and then run that code to get the answer. This is what a REPL tool does.

So, we needed two things:

  1. Prompt the LLM to generate code to do the data aggregation
  2. Execute that code and return the aggregated data

Here’s how we did it 👇

Create Tools

First, we created a REPL instance. By setting the global parameters within the REPL instance, we were able to pass in the raw data to the instance. Reffering to the raw data a variable named ‘df’, the generated code would then reflect this naming convention when it is executed.

This is how we chose to pass a Pandas DataFrame to a Python REPL instance 💪.

from langchain_experimental.utilities import PythonREPL

# Create an instance of PythonREPL
repl = PythonREPL()

# Pass the dataframe into the globals dictionary of the PythonREPL instance
repl.globals['df'] = df

Then using LangChain, we can create the tools for the LLM.

from typing import Annotated
from langchain_core.tools import tool

@tool
def python_repl(
code: Annotated[str, "The python code to execute to generate your chart."]
):
"""Use this to execute python code. If you want to see the output of a value,
you should print it out with `print(...)`. This is visible to the user."""
try:
result = repl.run(code)
except BaseException as e:
return f"Failed to execute. Error: {repr(e)}"
result_str = f"Successfully executed:\n```python\n{code}\n```\nStdout: {result}"
return (
result_str + "\n\nIf you have completed all tasks, respond with FINAL ANSWER."
)

tools = [python_repl]

Create Prompts

Second step was to create the prompts for the agent. We first set the instructions (also can be known as the system prompt). This gives the agent initial context of its purpose.

Within this prompt, we included context around the variable df (which was reffered to in the set up of the REPL tool). Instructing the model that the variable df has already been readforces the model not to create its own version of the dataframe within the code (which we noticed it did if not instructed).

instructions_template = '''

You are an agent that writes and excutes python code

You have access to a Python abstract REPL, which you can use to execute the python code.

You must write the python code code assuming that the dataframe (stored as df) has already been read.

If you get an error, debug your code and try again.

You might know the answer without running any code, but you should still run the code to get the answer.

If it does not seem like you can write code to answer the question, just return "I don't know" as the answer.

Do not create example dataframes
'''

Then we used a ReAct base prompt template for the agent and used partial variables to format in the instructions above. We also format in the new input, which we will use to invoke the agent.

base_template = '''

{instructions_template}

TOOLS:

------

You have access to the following tools:

{tools}

To use a tool, please use the following format:

```

Thought: Do I need to use a tool? Yes

Action: the action to take, should be one of [{tool_names}]

Action Input: the input to the action

Observation: the result of the action

```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```

Thought: Do I need to use a tool? No

Final Answer: [your response here]

```

Begin!

Previous conversation history:

{chat_history}

New input: {input}

{agent_scratchpad}
'''
from langchain_core.prompts import PromptTemplate

base_prompt = PromptTemplate(template=base_template, input_variables=['agent_scratchpad', 'input', 'instructions', 'tool_names', 'tools'])
base_prompt = base_prompt.partial(instructions_template=instructions_template) #format the instructions

Create Agent

The agent can then be defined with the model, tools and base prompt. For an input, we took the output from 2. Augmentation of Prompt (see above) and prompted the agent to create a dataset for the Bar Chart. This is the prompt we chose to use for invoking the agent. 👇

new_input = '''
Create the data needed for the following chart

CHART

--------
Chart 1: Bar Chart - Top 10 Wineries by Average Points

X-axis: Winery names

Y-axis: Average points\n*

You're task is to write code to transform the raw data into the desired data structure in json to be included in the plot.

Only return the data.

Here is the dataset schema:

{schema}

'''.format(schema=metadata['Schema'])
from langchain.agents import AgentExecutor, create_react_agent

agent = create_react_agent(gpt4o, tools, base_prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
agent_out = agent_executor.invoke(
{
"input": new_input,
"chat_history": ""
}
)

This is what the agent produced. It uses the ReAct framework to decide it needs to write code first, then use the REPL tool to execute that code and return the final data structure for our plot.

> Entering new AgentExecutor chain...
```

Thought: Do I need to use a tool? Yes

Action: python_repl

Action Input:

```python
# Grouping the dataframe by 'winery' and calculating the average points
top_wineries = df.groupby('winery')['points'].mean().sort_values(ascending=False).head(10)

# Converting the result to a dictionary
top_wineries_dict = top_wineries.to_dict()

print(top_wineries_dict)

```
Successfully executed:

```python
# Grouping the dataframe by 'winery' and calculating the average points
top_wineries = df.groupby('winery')['points'].mean().sort_values(ascending=False).head(10)

# Converting the result to a dictionary
top_wineries_dict = top_wineries.to_dict()

print(top_wineries_dict)

```

```
Stdout: {'Macauley': 96.0, 'Heitz': 95.5, 'Bodega Carmen Rodríguez': 95.5, 'Maurodos': 95.0, 'Blue Farm': 95.0, 'Numanthia': 95.0, 'Château Lagrézette': 95.0, 'Patricia Green Cellars': 95.0, 'Ponzi': 95.0, 'Muga': 95.0}


If you have completed all tasks, respond with FINAL ANSWER.```

Final Answer: {'Macauley': 96.0, 'Heitz': 95.5, 'Bodega Carmen Rodríguez': 95.5, 'Maurodos': 95.0, 'Blue Farm': 95.0, 'Numanthia': 95.0, 'Château Lagrézette': 95.0, 'Patricia Green Cellars': 95.0, 'Ponzi': 95.0, 'Muga': 95.0}

```

> Finished chain.

The final data structure which shows the Top 10 Winerys by Average Points looks something like this. This data is now aggregated to be able to be plotted in a Bar Chart.

Final Answer: {'Macauley': 96.0, 
'Heitz': 95.5,
'Bodega Carmen Rodríguez': 95.5,
'Maurodos': 95.0, 'Blue Farm': 95.0,
'Numanthia': 95.0,
'Château Lagrézette': 95.0,
'Patricia Green Cellars': 95.0,
'Ponzi': 95.0,
'Muga': 95.0}

Recap

In this article I covered how we have build an end to end flow for an Agent to ingest CSV data, suggest a visualisation to use and then write the code and execture the code to processes and aggregate the data ready to be plotted 🙌.

If you have any questions let us know in the comments 👇.

Stay curious 🤝!

Here is me writing this article!

--

--

Cubode Team

We're the founders of Cubode; an early stage startup betting on Agentic workflows. We build deterministic tools for AI Agents, released no code & open source.