Crafting a Graph-Generating Data Analysis Chatbot with OpenAI Assistant API

Alozie Igbokwe
14 min readMar 23, 2024

--

Welcome to Part 3 of our enlightening series, where we dive into the fascinating world of creating a data analysis chatbot capable of generating images through code interpretation. This installment builds upon the foundational skills covered in Parts 1 and Part 2, focusing on the setup of Retrieval Functionality and the intricacies of function calling with OpenAI Assistant. If you haven’t explored these initial segments yet, I highly recommend revisiting them to fully grasp the techniques that will be essential for today’s journey.

In this article, we’ll transition seamlessly into the realm of using code interpretation to empower our chatbot, enabling it to leverage OpenAI Assistant for sophisticated data analysis tasks. Prepare to unlock the potential of automating graphical data insights, all through the innovative application of OpenAI’s capabilities.

Part 1 : Intro into Building a Chatbot for Data Analysis and Visualization

As previously discussed, our objective is to develop a chatbot capable of analyzing statistical data from Excel spreadsheets and generating graphs in response to user queries.

This is achieved through the integration of OpenAI’s code interpretation capabilities. Code interpreters usually facilitate this process by employing the matplotlib library, which is renowned for its ability to create a wide range of static, interactive, and animated visualizations.

This is done in conjunction with pandas, a comprehensive tool for data manipulation and analysis.

Together, these libraries empower the chatbot to not only interpret the data but also to construct and display graphs that accurately represent the user’s requests.

Below is an example of the type of code a code interpreter has generated and executed to produce graphs for me based on the specified data I gave it:

import json
import matplotlib.pyplot as plt

# Load the data from the uploaded file
with open('/mnt/data/file-GOl7W1MgPiqsNZy1k3AKwxNB', 'r') as file:
data = json.load(file)
# Extract the deal owners and the corresponding deal amounts
# Note: Deal amounts are strings with commas, so they need to be converted to float
deal_owners = [deal['Deal Owner'] for deal in data]
deal_amounts = [float(deal['Deal Amount'].replace(',', '')) for deal in data]
# Combine deal owners and amounts into a dictionary to sum amounts for each owner
deal_amounts_by_owner = {}
for owner, amount in zip(deal_owners, deal_amounts):
deal_amounts_by_owner[owner] = deal_amounts_by_owner.get(owner, 0) + amount
# Sort the owners and amounts for plotting
sorted_owners = sorted(deal_amounts_by_owner.keys())
sorted_amounts = [deal_amounts_by_owner[owner] for owner in sorted_owners]
# Plot the bar graph
plt.figure(figsize=(10, 6))
plt.bar(sorted_owners, sorted_amounts, color='skyblue')
plt.xlabel('Deal Owner', fontsize=14)
plt.ylabel('Total Deal Amount ($)', fontsize=14)
plt.title('Total Deal Amount by Owner', fontsize=16)
plt.xticks(rotation=45)
plt.tight_layout()
# Save the plot to a file
plt.savefig('/mnt/data/deal_amounts_by_owner.png')
plt.show()
…..

Part 2 : Breakdown of What the Data We feed into OpenAI Looks Like

Now Let’s delve into the methodology for harnessing the Assistant API to generate graphs:

Initially, it’s imperative to supply OpenAI with the data file it will analyze. My attempt involved utilizing an Excel file, but I encountered challenges with OpenAI directly accessing and interpreting the data for graph generation. It could still create graphs using the excel but there was some latency issues sometimes especially when using the api.

The core of the problem seems to stem from the data being translated into a format that is difficult for the code interpreter/AI assistant to process effectively. Here is an example of the challenging format encountered when the AI assistant attempts to read directly from an Excel files using code interpreter :

- Unnamed: 0    Unnamed: 1              Unnamed: 2  Unnamed: 3 Unnamed: 4  \
0 NaN Sales Sheet NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN Deal Orders Company Status Contact
4 NaN Deal 0001 Supercharge Car Parts Proposal David
5 NaN Deal 002 Marketing Gen Evaulation Patrick
6 NaN Deal 003 Consulting D Evaulation Walter
7 NaN Deal 004 Real Esate Empire Qualifed Myron
8 NaN Deal 005 Skin Care Ult Win Kim
9 NaN Deal 006 Botpress Negotation Carl
10 NaN Deal 007 R Toy Lose Tamn
    Unnamed: 5                                         Unnamed: 6  \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 Deal Owner Features client wants implemented/Version Used
4 Alozie Version 1: Basic Version(Mostly no code implem...
5 Ryan Vesion 4: Enterprise Verion (hook up with var...
6 Cole Version 1: Basic Version(Mostly no code implem...
7 Ryan Version 2: Advanced Version(would require cust...
8 Cole Version 3: Complex Version (GPT Vision pytorc...
9 Alozie Version 3: ComplexVersion(Need to build a hig...
10 Ryan Version 3: Complex Version(Need to build high ...
Unnamed: 7 Unnamed: 8 \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 Deal Amount Expenses(on the company's side)
4 8,000 3000
5 30,000 16000
6 3,000 200
7 17,000 4000
8 28,350 18300
9 22,000 8000
10 26,000 8500
Unnamed: 9 Unnamed: 10 \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 Probablity of the Deal(in %) Weighted forecast
4 82 6560
5 55 8800
6 65 1950
7 93 15800
8 100 18300
9 78 17160
10 0 0
Unnamed: 11 Unnamed: 12 \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 Referral Potential(sclae from 1-10 target close date
4 7 Nov 28th 2023
5 9 Dec 15th 2023
6 3 Nov 31th 2023
7 6 Dec 8th
8 5 Nov 22th 2023
9 7 Nov 29th 2023
10 40 Nov 22th 2023
Unnamed: 13 Unnamed: 14
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 Projected time it took to close Last Interaction
4 3 weeks Nov 23th 2023
5 5 weeks Nov 19th 2023
6 2 weeks Nov 22th 2023
7 3.5 weeks Nov 18th 2023
8 5 weeks Nov 22th 2023
9 4.5 weeks Nov 21th 2023
10 3 weeks Nov 22th 2023

This phenomenon occurs due to how pandas, a Python data manipulation library, interprets Excel files.

Specifically because:

  • NaN Values issue : These indicate the absence of data within the dataset. Excel treats any cell without content as empty. When pandas imports the Excel file, it converts these empty cells to ‘NaN’ to denote missing values.
  • Unnamed Columns issue : These arise when an Excel sheet features columns without header names. In such cases, pandas automatically assigns a label of ‘Unnamed: X’ to these columns, where ‘X’ is the column’s index, starting from 0. This can happen under two main conditions:

Other issues :

  • The Excel sheet has empty cells in the header row (the first row by default or as specified by the ‘header’ parameter in the pd.read_excel() function).
  • The dataset includes one or more initial rows that pandas interprets as part of the header, due to either formatting decisions or the application of the ‘header’ parameter.

Now solutions like defining headers explicitly can help improve the readability and transformation of data which Openai was able to figure out using code interpreter.

Though I went ahead anyways and tried preprocessing the Excel file to ensure data integrity and clarity before ingestion by the AI assistant to help improve latency speeds.

Should have experimented with different type of files to like csv files also and see if that changed anything.

Part 3 : Cleaning up Statistical Data for AI Consumption

To overcome these challenges, I added code to efficiently transform the data from the Excel spreadsheet into a cleaner JSON format and write it into a text file. This approach makes the data more accessible for processing by the AI Assistant and just more ingestible in general.

Loading the Sales Data from Excel:

excel_file = 'salessheet.xlsx'
data = pd.read_excel(excel_file, sheet_name='Sales Data', header=3)

The process begins by loading sales data from the ‘salessheet.xlsx’ file, targeting the ‘Sales Data’ sheet. The header=3 parameter tells pandas to use the fourth row (pandas is zero-indexed) as the header, accommodating any preliminary headers or metadata.

Converting DataFrame to a JSON String:

json_str = data.to_json(orient='records')

After loading into a pandas DataFrame, the data is converted into a JSON string representation with to_json(orient='records'). This organizes the data as a list of records, making it easily interpretable.

Parsing JSON String to Python Object:

data_list = json.loads(json_str)

This JSON string is then parsed back into a Python object — a list of dictionaries — where each dictionary represents a row from the Excel file, and keys are the column names.

Creating a Dynamic Mapping for Unnamed Columns:

header_mapping = data_list[0]
key_mapping = {f'Unnamed: {i}': header_mapping[f'Unnamed: {i}'] for i in range(len(header_mapping)) if header_mapping[f'Unnamed: {i}'] is not None}

A crucial step involves creating a dynamic mapping for any unnamed columns, such as those labeled by pandas as ‘Unnamed: X’. This mapping uses the first row of the data, assumed to be the header row, to replace ‘Unnamed’ labels with meaningful names.

Cleaning the Data and Applying Dynamic Mapping :

cleaned_data = []
for item in data_list[1:]: # Skip the first row as it's used for mapping
cleaned_item = {}
for key, value in item.items():
new_key = key_mapping.get(key, key)
if new_key is not None and value is not None: # Exclude keys and values that are None
cleaned_item[new_key.replace('Unnamed: ', '')] = value
cleaned_data.append(cleaned_item)

The data is cleaned, applying the dynamic mapping to each row (excluding the first row used for mapping). This process ensures the JSON output contains only relevant and correctly labeled data.

Writing the Cleaned JSON Data to a Text File:

json_data = json.dumps(cleaned_data, indent=4)
print(json_data)
# Optionally, write this list of dictionaries to a text file
with open('output.txt', 'w') as file:
file.write(json_data)

Finally, the cleaned and structured JSON data is optionally written to an ‘output.txt’ file. This process enhances data usability by transforming it into a format understandable by JSON-compatible systems and performing essential data cleaning and re-mapping.

Here is how my data looks now :

[
{
"Deal Orders": "Deal 0001",
"Company": "Supercharge Car Parts ",
"Status ": "Proposal",
"Contact": "David ",
"Deal Owner": "Alozie",
"Features client wants implemented/Version Used": "Version 1: Basic Version(Mostly no code implementation)",
"Deal Amount": "8,000",
"Expenses(on the company's side)": 3000,
"Probability of the Deal(in %)": 82,
"Weighted forecast": 6560,
"Referral Potential(scale from 1-10)": 7,
"target close date": "Nov 28th 2023",
"Projected time it took to close": "3 weeks",
"Last Interaction": "Nov 23th 2023"
},
{
"Deal Orders": "Deal 002",
"Company": "Marketing Gen",
"Status ": "Evaluation",
"Contact": "Patrick",
"Deal Owner": "Ryan",
"Features client wants implemented/Version Used": "Vesion 4: Enterprise Verion (hook up with various apis web scrape a bunch of data and find high level databases with info needed)",
"Deal Amount": "30,000",
"Expenses(on the company's side)": 16000,
"Probability of the Deal(in %)": 55,
"Weighted forecast": 8800,
"Referral Potential(scale from 1-10)": 9,
"target close date": "Dec 15th 2023",
"Projected time it took to close": "5 weeks",
"Last Interaction": "Nov 19th 2023"
},
{
"Deal Orders": "Deal 003",
"Company": "Consulting D ",
"Status ": "Evaluation",
"Contact": "Walter ",
"Deal Owner": "Cole",
"Features client wants implemented/Version Used": "Version 1: Basic Version(Mostly no code implementation)",
"Deal Amount": "3,000",
"Expenses(on the company's side)": 200,
"Probability of the Deal(in %)": 65,
"Weighted forecast": 1950,
"Referral Potential(scale from 1-10)": 3,
"target close date": "Nov 31th 2023",
"Projected time it took to close": "2 weeks",
"Last Interaction": "Nov 22th 2023"
},
{
"Deal Orders": "Deal 004",
"Company": "Real Estate Empire ",
"Status ": "Qualified",
"Contact": "Myron",
"Deal Owner": "Ryan ",
"Features client wants implemented/Version Used": "Version 2: Advanced Version(would require custom code to integrate with api",
"Deal Amount": "17,000",
"Expenses(on the company's side)": 4000,
"Probability of the Deal(in %)": 93,
"Weighted forecast": 15800,
"Referral Potential(scale from 1-10)": 6,
"target close date": "Dec 8th",
"Projected time it took to close": "3.5 weeks",
"Last Interaction": "Nov 18th 2023"
},
{
"Deal Orders": "Deal 005",
"Company": "Skin Care Ult",
"Status ": "Win",
"Contact": "Kim ",
"Deal Owner": "Cole",
"Features client wants implemented/Version Used": " Version 3: Complex Version (GPT Vision pytorch implementation)",
"Deal Amount": "28,350",
"Expenses(on the company's side)": 18300,
"Probability of the Deal(in %)": 100,
"Weighted forecast": 18300,
"Referral Potential(scale from 1-10)": 5,
"target close date": "Nov 22th 2023",
"Projected time it took to close": "5 weeks",
"Last Interaction": "Nov 22th 2023"
},
{
"Deal Orders": "Deal 006",
"Company": "Botpress",
"Status ": "Negotiation",
"Contact": "Carl",
"Deal Owner": "Alozie",
"Features client wants implemented/Version Used": "Version 3: ComplexVersion(Need to build a high level RAG Pipeline)",
"Deal Amount": "22,000 ",
"Expenses(on the company's side)": 8000,
"Probability of the Deal(in %)": 78,
"Weighted forecast": 17160,
"Referral Potential(scale from 1-10)": 7,
"target close date": "Nov 29th 2023",
"Projected time it took to close": "4.5 weeks",
"Last Interaction": "Nov 21th 2023"
},
{
"Deal Orders": "Deal 007",
"Company": "R Toy",
"Status ": "Lose ",
"Contact": "Tamn",
"Deal Owner": "Ryan",
"Features client wants implemented/Version Used": "Version 3: Complex Version(Need to build high level RAG Pipeline",
"Deal Amount": "26,000",
"Expenses(on the company's side)": 8500,
"Probability of the Deal(in %)": 0,
"Weighted forecast": 0,
"Referral Potential(scale from 1-10)": 40,
"target close date": "Nov 22th 2023",
"Projected time it took to close": "3 weeks",
"Last Interaction": "Nov 22th 2023"
}
]

Then we continue the process by creating a file object that points to our data source, in this case, a text file named “output.txt”. This is accomplished through the following code snippet:

file = client.files.create(
file=open("output.txt", "rb"),
purpose='assistants'
)

Following the creation of the file object, we proceed to establish the assistant itself. This is done by calling the API to create an assistant with a specific set of instructions tailored for analyzing sales data:

assistant = client.beta.assistants.create(
name= "Data Analyst Sales Assistant",
instructions="To effectively analyze the sales data, follow these steps: Firstly, read the JSON data from the text file using Python's json module, ensuring you convert the content into a structured Python format (dictionary or list). Next, familiarize yourself with the data structure, identifying key metrics and details such as 'Deal Orders', 'Company', 'Status', 'Contact', and financial figures. Utilize this understanding to generate comprehensive summaries, identify trends, and provide insights, including total sales, average deal sizes, and status distributions. Additionally, create accurate and user-friendly visual representations of the data to convey these insights clearly. Efficiently handle queries related to the data, and ensure all information provided is accurate and up-to-date.",
tools=[{"type": "code_interpreter"}, {"type": "retrieval"}],
model="gpt-4-1106-preview",
file_ids=[file.id]
)

This assistant is equipped with specific functionalities, notably the ability to read and analyze data from the provided text file and the capacity to use retrieval features. Instructions to the assistant emphasize the importance of:

  1. Reading the JSON File: Utilizing Python’s json module to parse data, converting it into a manageable format analysis using python.
  2. Understanding the Data Structure: Recognizing the arrangement of data to accurately extract and analyze pertinent details.
  3. Data Analysis and Insights: Generating detailed summaries and visualizations to highlight trends and derive meaningful insights from the data.
  4. Visualization of Data: Creating clear and informative charts to illustrate sales trends and patterns.
  5. Query Handling: Addressing a variety of queries with efficiency and precision.
  6. Maintaining Data Accuracy: Ensuring the reliability of the data presented.

The rationale behind these instructions is:

  • Reading and Understanding the JSON File: JSON’s lightweight and human-readable format makes it an ideal choice for data interchange. By employing Python’s json module, the assistant exactly knows to read and interpret the JSON file efficiently with code by translating the data into a structured Python dictionary or list for easy manipulation.
  • Analyzing and Visualizing Data: Given the complexity of JSON files, which can encompass intricate data structures, it is crucial for the assistant to comprehend the specific configuration of the sales data. This understanding enables the assistant to accurately process and analyze the data, whether summarizing key metrics, identifying trends, or comparing different datasets.

Part 5 : Running Thread and Retrieving the Image

In this segment, we discuss the process of extracting and saving the visual content from a conversation thread using the Assistant API. This method is particularly useful for preserving and outputting the generated images created through code interpretation.

Firstly, we retrieve the list of messages from a specific conversation thread:

messages = client.beta.threads.messages.list(
thread_id=thread.id
)

To systematically store these messages, we serialize the message data into JSON format and save it to a file, “messages1.json”. This step not only archives the conversation but also provides an accessible format for future analysis:

with open("messages1.json", "w") as f:
messages_json = messages.model_dump()
json.dump(messages_json, f, indent=4)

Following this, we iterate through each message to identify and process textual and visual elements separately

print("Displaying the conversation:")
for msg in messages.data:
role = msg.role
for content_item in msg.content:
if content_item.type == "text":
text_content = content_item.text.value
print(f"{role.capitalize()}: {text_content}")
elif content_item.type == "image_file":
image_file_id = content_item.image_file.file_id
print(f"Image File ID: {image_file_id}")
# Retrieve the image content using the client's method
image_data = client.files.content(image_file_id)
image_data_bytes = image_data.read()
# Save the image data to a file
with open(f"./image_{image_file_id}.png", "wb") as file:
file.write(image_data_bytes)
print(f"Image {image_file_id} saved as image_{image_file_id}.png")

The process checks each item within a message for its content type. If the content type is text, it prints the text.

For image files, it utilizes the image_file.file_id to retrieve the image data from OpenAI's servers using the client.files.content method. This method, requiring the file_id, fetches the file's content, which is then read into a bytes object.

The image data, now in binary form, is saved locally with a unique filename derived from its file_id, ensuring easy identification and access.

Part 6 : The Output

Here is the final text and image output my Assistant gave me :

isplaying the conversation:
Image File ID: file-friBMjhBb5B7KlLrXAhqr9QZ
Image file-friBMjhBb5B7KlLrXAhqr9QZ saved as image_file-friBMjhBb5B7KlLrXAhqr9QZ.png
Assistant: The bar graph above measures the expenses on the company's side for each deal, as listed in the data you provided. Each horizontal bar corresponds to a different deal, and the length of the bar represents the amount of expenses associated with that deal. If you need any further analyses or modifications to the graph, feel free to let me know!
Image file-friBMjhBb5B7KlLrXAhqr9QZ saved as image_file-friBMjhBb5B7KlLrXAhqr9QZ.png
Assistant: The DataFrame has been successfully created with the JSON data, and we can see that it includes a column labeled `Expenses(on the company's side)` which seems to be the one containing the expense information for each deal. The column `Deal Orders` appears to contain identifiers for each deal.

With this information, I can now create a bar graph that measures the expenses of each deal. Let's proceed with the visualization.
Image file-friBMjhBb5B7KlLrXAhqr9QZ saved as image_file-friBMjhBb5B7KlLrXAhqr9QZ.png
Assistant: The JSON data has been successfully loaded into a DataFrame, but it looks like I forgot to actually display the DataFrame structure. Let's take a look at the first few rows of the DataFrame to understand its contents and identify the columns that contain the expense information for each deal.
Image file-friBMjhBb5B7KlLrXAhqr9QZ saved as image_file-friBMjhBb5B7KlLrXAhqr9QZ.png
Assistant: It looks like the file contains JSON, a common data exchange format that is quite different from CSV or Excel files. To proceed with creating the bar graph, I will need to parse this JSON data into a format that we can use for analysis. Let's begin by loading the JSON data into a pandas DataFrame, which will make it easier to work with. Then we can identify the expenses associated with each deal and create the bar graph. I'll go ahead and do that now.
Image file-friBMjhBb5B7KlLrXAhqr9QZ saved as image_file-friBMjhBb5B7KlLrXAhqr9QZ.png
Assistant: There seems to be an issue with reading the file; it's neither a standard CSV nor an Excel file. The error suggests the structure of the file may not be aligned with typical delimited data formats, or it could be a different file type or format altogether.

Given the information from the error message, we might need to inspect the file manually to determine its format. Let's first try to read the file as a plain text file to explore its structure.
Image file-friBMjhBb5B7KlLrXAhqr9QZ saved as image_file-friBMjhBb5B7KlLrXAhqr9QZ.png
Assistant: Before I can create a bar graph to show the expenses of each deal, I will need to examine the contents of your data file to understand the structure and find the information needed for the visualization. I'll start by opening the file and taking a look at its contents.
Image file-friBMjhBb5B7KlLrXAhqr9QZ saved as image_file-friBMjhBb5B7KlLrXAhqr9QZ.png
User: create me a bar graph measuring the expensives of each deal in our data file
Image file-friBMjhBb5B7KlLrXAhqr9QZ saved as image_file-friBMjhBb5B7KlLrXAhqr9QZ.png

Additional Resources:

If you want to connect with me here is my LinkedIn — Alozie Igbokwe LinkedIn Profile

--

--

Alozie Igbokwe

I write articles on AI Agents. Here is my YT Channel if you want to see walkthroughs on AI Agents you can build for your business -https://shorturl.at/5s1tN