OpenAI Assistants and Data Science

Jason Merwin
8 min readNov 27, 2023

--

Introduction

One of ChatGPT’s most significant features is its capability to write and execute code based on natural language prompts. This transforms ChatGPT from a conversational agent into a powerful tool for data analysis. OpenAI recently released an API that enables developers to generate custom GPTs, referred to as Assistants, that can be enhanced by attaching specialized functions, instructions, and data. For Data Scientists, this has the obvious potential to significantly increase productivity but perhaps more importantly it has the potential to change how we do data science. In this article I demonstrate an introductory project with the Assistants API showing how to programmatically create, instruct, exchange data with, and coordinate multiple Assistants for data science tasks.

Project Overview

The Assistants API provided by OpenAI allows developers to build custom AI assistants capable of performing a variety of tasks (OpenAI). Multiple tools can be attached to the Assistant like a Code Interpreter, Knowledge Retrieval, and even customized functions defined by the user (more articles on that coming soon).

The goal of this project was to use the API to create Assistants and test their ability to carry out moderately complex Data Science tasks like data cleaning, feature engineering, feature selection, and training a machine learning model. The dataset used in this study is the Wisconsin Breast Cancer Dataset, downloaded from the UCI Data Repository. It is a set of is of 699 samples compiled by Dr. William H. Wolberg for use in developing algorithms in cancer diagnosis and prognosis. The classification target is the tumor diagnosis of malignant or benign with 9 features per sample.

Methodology

Setting up an account with OpenAI is pretty straight forward. It requires the usual account set up steps like an email address, password, and then account verification. Once the account is set up you can request an API Key which is required for authentication of the API requests. The final step is to complete the billing information; free tiers are available but only allow limited usage. The costs associated with a paid tier are negligible for individual use. This entire project thus far has cost me less than 5$. The complete code for this project is available here.

First install the openai module into your environment. I used JupyterLab notebooks and a simple pip install worked fine. Begin by defining your API key and instantiating the openai client.

# set key and assistant ID
OPENAI_API_KEY = 'your_API_key'

# Instantiate the OpenAI client
client = openai.OpenAI(api_key=OPENAI_API_KEY)

Next, we want to create the assistant, give it a data set as a csv file, and define the instructions and actions the assistant should carry out with the provided data set.

# load and check the file for the engineer
asst_file = 'tumor.csv'
df = pd.read_csv(asst_file)

# create the assistant instructions
mls = '''
You are a data engineer who will work with data in a csv file in your files.
When the user asks you to perform your actions, use the csv file to read the data into a pandas dataframe.
The data set is to be used for a classification model.
Execute each of the steps listed below in your ACTIONS section. The user will identify the target variable.

ACTIONS:
1. Read the file data into a pandas DataFrame.
2. Summarize each feature and the target variable in the data set and prepare the results as Table_1.
3. Check for missing values and impute the column mean for any missing values.
4. Create a two new feature interaction columns for each unique pair of variables, using multiplication for one interaction column and dividion for the other.
5. Run a logistic regression to predict the target variable with LASSO to select features. Use a lambda values of 1.
6. Prepare the Lasso coefficient values as Table_2.
7. Prepare a final data set that only contains features with non-zero LASSO coefficients and the target variable as Table_3
8. Provide a summary paragraph explaining the preparation of the data set.
9. Prepare Table_1, Table_2, and Table_3 as csv files for download by the user.

DO NOT:
1. Do not return any images.
'''

# send the csv file to the assistant purpose files
response = client.files.create(
file=open(asst_file, "rb"),
purpose="assistants"
)
print(response)
file__id = response.id

my_assistant = client.beta.assistants.create(
instructions=mls,
name="engine_1",
tools=[{"type": "code_interpreter"}],
model="gpt-4-1106-preview", # gpt-4
file_ids=[file__id]
)

# get the file id
fileId = my_assistant.file_ids[0]

Next, we need to create a Thread with the assistant. A Thread represents a conversation and can be referenced by id to create and preserve context across multiple messages. Create a message with instructions and finally run the assistant.

message_string = "Please execute your ACTIONS on the data stored in the csv file " + fileId + " . The Target variable is Class"
print(message_string)

# Create a Thread
thread = client.beta.threads.create()

# Add a Message to a Thread
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content= message_string
)

# Run the Assistant
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=my_assistant.id
#instructions="Overwrite hard-coded instructions here"
)

The assistant will require a few seconds to generate a response. I used a simple while loop with a 60 second wait before checking for a response.

while True:
sec = 60
# Wait for 5 seconds
time.sleep(sec)
# Retrieve the run status
run_status = client.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id
)
print(f'{sec} seconds later...')
# If run is completed, get messages
if run_status.status == 'completed':
messages = client.beta.threads.messages.list(
thread_id=thread.id
)
# Loop through messages and print content based on role
for msg in messages.data:
role = msg.role
try:
content = msg.content[0].text.value
print(f"{role.capitalize()}: {content}")
except AttributeError:
# This will execute if .text does not exist
print(f"{role.capitalize()}: [Non-text content, possibly an image or other file type]")
break

Once you receive the response, you will need to extract the file id so that you can retrieve its contents for download and reading back into a csv file. In some cases it might be necessary for a single response to return multiple files. To accommodate this possibility, I wrote a set of functions that read each file by file id, downloads it’s contents, and saves locally. You can then read each file back into a data frame to inspect the results.

def read_and_save_file(first_file_id, file_name):    
# its contents are binary, so read it and then make it a file like object
file_data = client.files.content(first_file_id)
file_data_bytes = file_data.read()
file_like_object = io.BytesIO(file_data_bytes)
#now read as csv to create df
returned_data = pd.read_csv(file_like_object)
returned_data.to_csv(file_name, index=False)
return returned_data
# file = read_and_save_file(first_file_id, "analyst_output.csv")

def files_from_messages(messages, asst_name):
first_thread_message = messages.data[0] # Accessing the first ThreadMessage
message_ids = first_thread_message.file_ids
print(message_ids)
# Loop through each file ID and save the file with a sequential name
for i, file_id in enumerate(message_ids):
file_name = f"{asst_name}_output_{i+1}.csv" # Generate a sequential file name
read_and_save_file(file_id, file_name)
print(f'saved {file_name}')

# extract the file names from the response and retrieve the content
asst_name = 'engineer'
files_from_messages(messages, asst_name)

df1 = pd.read_csv('engineer_output_1.csv')
display(df1)
Data engineer assistant output

As you can see, the Assistant returned the engineered data set, following the instructions to build feature interaction pairs, and then used a logistic regression with LASSO regularization to reduce the features to only those with non-zero coefficients. This reduced the total feature count to 38. Notice that some of the original features were eliminated by this step and are only retained as part of the feature interactions engineered by the Assistant.

From here you will want to clean up the session by deleting the Assistant otherwise they will accumulate with each assistants.create() call. Luckily OpenAI provides a web based interface with the Assistants called “The Playground” where you can check to see how many Assistants are languishing there that you forgot to erase. A link is provided in the references section below.

# Clean up the assistant
response = client.beta.assistants.delete(my_assistant.id)
print(response)

Now that we have our prepared data set we need to repeat the process and create a modeling Assistant to load the data table returned from the engineer Assistant. The instructions for this second Assistant are to load the csv data into a pandas data frame, split it into training and testing (75:25), train an “Extra Trees” random forest with 2,000 trees, use the testing data to measure the model’s accuracy, precision, recall, and finally generate a confusion matrix from the test data predictions. The code for these steps is essentially the same as the first Assistant and are available in the GitHub Repository here.

Results

Model Performance Metrics and Confusion Matrix

The performance of the classification model was on par with typical scores from data scientists using the same data set. The Assistant’s model scored with accuracy and precision above 0.97. The UCI Data Set Repository reports a variety of machine learning models having an average accuracy of 0.965 and precision at 0.955.

Conclusions / Discussion

The bottom line. The goal of this project was to see how the OpenAI Assistants would work with a simple data set, multi-step instructions for data engineering and modeling, and measure the performance of the trained model. It appears that the Assistant is quite capable of converting descriptive instructions into code, executing that code, and training a respectable classification model on it. The model’s performance was equal to what you would expect from a Data Scientist but was completed in less than 5 minutes for a cost of about 25 cents.

Some limitations worth mentioning. While the results are impressive, there were a few limitations encountered while working on this project. Compute power is limited. I received a few responses explaining that the requested actions would exceed the computational limits for the Assistant, so I had to scale back some of my instructions and chose a modeling approach that wasn’t too robust. This is why I went with the Extra Trees Random Forest for the model as it is computationally light weight. Also, I chose a data set that is a soft-ball test for this introduction. More difficult data sets and complicated sets of instructions will certainly reveal additional weaknesses. It is also certain that all of the limitations I’m discussing here are temporary.

Should we be worried. As the capabilities of AI increase, particularly in its ability to write and execute code, the concern that it might replace human programmers and analysts is understandable. Am I concerned? Of course. But the more I use AI in data science, the less I worry about the future it will bring. Our jobs as data scientists will no doubt be impacted by it, but I don’t think it will necessarily be replacing anyone. Looking at the awesome power of AI and only seeing a way to cut costs misses the point of the technology and potential it offers. In my opinion, the successful use of AI in data science will be seen in those who view it as a tool to rethink and reinvent how we do data science. Faster, more accurate, more comprehensive models are an initial step, but ultimately, we should be thinking about and exploring approaches that wouldn’t be feasible without AI. We shouldn’t use old maps to navigate these new roads.

References

https://chat.openai.com/ — ChatGPT

https://archive.ics.uci.edu/ — UCI Data Set Repository

https://platform.openai.com/playground — The GPT Assistant platform

--

--