TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

End to End ML with GPT-3.5

Learn how to use GPT-3.5 to do the heavy lifting for data acquisition, preprocessing, model training, and deployment

Alex Adam
TDS Archive
Published in
14 min readMay 24, 2023

--

A lot of repetitive boilerplate code exists in the model development phase of any machine learning application. Popular libraries such as PyTorch Lightning have been created to standardize the operations performed when training/evaluating neural networks, leading to much cleaner code. However, boilerplate extends far beyond training loops. Even the data acquisition phase of machine learning projects is full of steps that are necessary but time consuming. One way to deal with this challenge would be to create a library similar to PyTorch Lightning for the entire model development process. It would have to be general enough to work with a variety of model types beyond neural networks, and capable of integrating a variety of data sources.

Code examples for extracting data, preprocessing, model training, and deployment is readily available on the internet, though gathering it, and integrating it into a project takes time. Since such code is on the internet, chances are it has been trained on by a large language model (LLM) and can be rearranged in a variety of useful ways through natural language commands. The goal of this post is to show how easy it is to automate many of the steps common to ML projects by using the GPT-3.5 API from OpenAI. I’ll show some failure cases along the way, and how to tune prompts to fix bugs when possible. Starting from scratch, without even so much as a dataset, we’ll end up with a model that is ready to be deployed on AWS SageMaker. If you’re following along, make sure to setup the OpenAI API as follows:

import openai
openai.api_key = "YOUR KEY HERE"

Also, the following utility function is helpful for calling the GPT-3.5 API:

def get_api_result(prompt):
request = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0301",
messages=[{"role": "user", "content": prompt}]
)

result = request['choices'][0]['message']['content']

print(result)

Extract, transform, load (ETL)

ETL Illustration by me + Midjourney

This section is simplified since it only considers a single data source, but can in principle be extended so situations where data comes from multiple sources (csv files, databases, images, etc.). The first step is to extract some data. For the sake of simplicity, I’ll use the Income Prediction¹ dataset where the goal is to predict if an individual earns more/less than $50k per year based on their education, job position, industry, etc. The function below will be used to generate the code that downloads our data. Note how the prompt template is designed to bias the API to generate python code:

def extract(prompt):
prompt_template = """You are a ChatGPT language model that can generate Python code. Please provide a natural language input text, and I will generate the corresponding Python code.\nInput: {}\nPython code:""".format(prompt)

get_api_result(prompt_template)

Here we ask GPT-3.5 to:

Retrieve the adult income prediction dataset from openml using the sklearn fetch_openml function. Make sure to retrieve the data as a single dataframe which includes the target in a column named “target”. Name the resulting dataframe “df”.

which gives us the output:

from sklearn.datasets import fetch_openml

adult_income = fetch_openml(name='adult', version=2)
df = pd.DataFrame(adult_income.data, columns=adult_income.feature_names)
df['target'] = adult_income.target

This code is free of bugs and gives us exactly what we want. Had I used a simpler prompt by removing mentions of openml and the function to use for retrieval, we would get:

import pandas as pd

df = pd.read_csv("adult_income_prediction.csv")
df.columns = ["age", "workclass", "fnlwgt", "education", "education-num",
"marital-status", "occupation", "relationship", "race",
"sex", "capital-gain", "capital-loss", "hours-per-week",
"native-country", "target"]

This assumes that the data is locally available. What’s interesting about this result is that it has the correct column names as a list, even though we did not include them in the API call! These names nevertheless are all over the web, particularly in this Medium post, except for the target column which is added by GPT. The next step is to transform the data into a format that is usable by machine learning models. We’ll have to use a more advanced prompt template here since GPT needs to be made aware of the column names and their types:

def transform(columns, column_types, prompt):
prompt_template = """You are a ChatGPT language model that can generate Python code. Please provide a natural language input text, and I will generate the corresponding Python code using the Pandas to preprocess the DataFrame. The DataFrame columns are {} and their corresponding dtypes are {}.\nInput: {}\nPython code:""".format(columns, column_types, prompt)

get_api_result(prompt_template)

I then make the following request:

Preprocess the dataframe by converting all categorical columns to their one-hot encoded equivalents, and normalizing numerical columns. Drop rows which have an NA or NaN value in any column. Drop rows that have numeric column outliers as determined by their z score. A numeric column outlier is a value that is outside of the 1 to 99 inter-quantile range. The numerical columns should be normalized using StandardScaler from sklearn. The values in the target colummn should be converted to 0 or 1 and should be of type int.

We now get:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the DataFrame
df = pd.read_csv('data.csv')

# Convert categorical columns to one-hot encoded equivalents
df = pd.get_dummies(df, columns=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])

# Normalize numerical columns
scaler = StandardScaler()
numerical_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Drop rows with NA or NaN values
df.dropna(inplace=True)

# Drop rows with numeric column outliers
z_scores = df[numerical_cols].apply(lambda x: (x - x.mean()) / x.std())
df = df[(z_scores > -3).all(axis=1) & (z_scores < 3).all(axis=1)]

# Convert target column to 0 or 1 and change type to int
df['target'] = df['target'].apply(lambda x: 1 if x == '>50K' else 0).astype(int)

which is almost perfect for our use if we exclude the part that tries to load in the data from a csv file. It’s quite difficult to convince GPT to exclude this, even if we explicitly tell it to assume that df exists and should not be loaded. Lastly, we need to load the data into a local database. This is overkill for such a simple use case, but is a good habit to develop.

def load(prompt):
prompt_template = """You are a ChatGPT language model that can generate Python code. Please provide a natural language input text, and I will generate the corresponding Python code.\nInput: {}\nPython code:""".format(prompt)

get_api_result(prompt_template)

I use the following instruction:

Connect to an sqlite database named “data”. Use pandas to insert data from a DataFrame named “df” into a table named “income”. Do not include the index column. Commit the changes before closing the connection.

in order to get:

import sqlite3
import pandas as pd

# Create a connection to the database
connection = sqlite3.connect("data.db")

# Define the DataFrame
df = pd.DataFrame()

# Define the table name
table_name = "income"

# Insert the data into the table
df.to_sql(table_name, connection, if_exists="replace", index=False)

# Commit the changes to the database
connection.commit()

# Close the connection
connection.close()

which is perfect, minus the unnecessary creation of df. With this processed data in hand, we are ready to train some models.

Model Training

Illustration of a loss function by me + Midjourney

Resources permitting, it is a good idea to try out a few different model types to identify the one with the right level of complexity for the given task. Therefore, we ask GPT-3.5 to try out a few different models. First, let’s set up the generic prompt template for model training:

def train(prompt):
prompt_template = """You are a ChatGPT language model that can generate Python code. Focus on using scikit-learn when applicable. Please provide a natural language input text, and I will generate the corresponding Python code.\nInput: {}\nPython code:""".format(prompt)

get_api_result(prompt_template)

and the prompt we’ll be using is:

Train a variety of classification models to predict the “target” column using all other columns. Do so using 5-fold cross validation to choose the best model and corresponding set of hyperparameters, and return the best overall model and corresponding hyperparameter settings. Choose the best model based on accuracy. Assume a dataframe named “df” exists which is to be used for training. Log the entire process using MLFlow. Start logging with mlflow before training any models so only a single run is stored. Make sure that the model is logged using the sklearn module of mlflow. Make sure that only the best overall model is logged, but log metrics for all model types. The mean value of the following metrics on all cross validation folds should be logged: accuracy, AUC, F1 score

Let’s have a look at the generated output and this time go deeper into why some of the specific instructions had to be provided.

import pandas as pd
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import mlflow
import mlflow.sklearn

# load the dataframe
df = pd.read_csv('path/to/dataframe.csv')

# define the features and target columns
features = df.columns.drop('target')
target = 'target'

# set up mlflow
mlflow.set_tracking_uri('localhost:5000') # or set up your own server
mlflow.set_experiment("my_experiment")

# set up cross-validation
cv = 5

# define the models and corresponding hyperparameters to be tuned
models = [
{
'name': 'Logistic Regression',
'estimator': LogisticRegression(),
'hyperparameters': {
'penalty': ['l1', 'l2'],
'C': [0.1, 1, 10]
}
},
{
'name': 'Decision Tree',
'estimator': DecisionTreeClassifier(),
'hyperparameters': {
'max_depth': [2, 4, 6, 8]
}
},
{
'name': 'Random Forest',
'estimator': RandomForestClassifier(),
'hyperparameters': {
'n_estimators': [10, 25, 50, 100],
'max_depth': [None, 2, 4, 6, 8]
}
}
]


# iterate through the models and corresponding hyperparameters to train and tune them
with mlflow.start_run():
best_model = None
best_metrics = {}

for model in models:
print(f"Training {model['name']} model...")

# set up grid search for hyperparameter tuning
gs = GridSearchCV(model['estimator'], model['hyperparameters'], cv=cv, verbose=2, n_jobs=-1, scoring='accuracy')
gs.fit(df[features], df[target])

# log metrics for all model types
metrics = cross_validate(gs.best_estimator_, df[features], df[target], cv=cv, n_jobs=-1,
scoring=["accuracy", "roc_auc", "f1"])
for metric, values in metrics.items():
mlflow.log_metric(f"{model['name']}_{metric}", values.mean())

# check if this is the best model based on accuracy
if best_model is None or gs.best_score_ > best_metrics['accuracy']:
best_model = gs.best_estimator_
best_metrics = {'accuracy': gs.best_score_,
'AUC': roc_auc_score(df[target], best_model.predict_proba(df[features])[:, 1]),
'F1': f1_score(df[target], best_model.predict(df[features]))
}

# register the best model in mlflow
mlflow.sklearn.log_model(sk_model=best_model, artifact_path="best_model")

# log the best model's metrics
for metric, value in best_metrics.items():
mlflow.log_metric(f"best_model_{metric}", value)

If we remove the loading of df and the section # set up mlflow , we end up with exactly what is desired. Namely, a loop over a 3 different model types, performing a grid search using 5-fold cross validation to identify the best hyperparmeters for the given model type, while keeping track of metrics. Without specifying “choose the best model based on accuracy”, the generated code will use scoring=[“accuracy”, “roc_auc", “f1”] for the grid search which will not work since there is ambiguity as to how to select the best model according to multiple metrics. Without “make sure that the model is logged using the sklearn module of mlflow”, we sometimes end up with mlflow.log_model() which is wrong. Also, “make sure that only the best overall model is logged” is necessary to avoid storing all models. Overall, this output is acceptable, but it’s unstable, and running it multiple times is likely to introduce different bugs. In order to have everything ready for the serving step, it is useful to add the model signature when saving the model. This signature is basically a collection of feature names and their corresponding types. It is a pain to get GPT-3.5 to add this, so some manual labor has to be done by first adding the import:

from mlflow.models.signature import infer_signature

and then modifying the line of code which logs the model via:

mlflow.sklearn.log_model(sk_model=best_model, artifact_path="best_model", signature=infer_signature(df[features], best_model.predict(df[features])))

Model Serving

Illustration of deployment by me + Midjourney

Since we used MLflow to log the best model, we have a couple of options to serve the model. The simplest option is to host the model locally. Let’s first design the general prompt template for model serving:

def serve_model(model_path, prompt):
prompt_template = """You are a ChatGPT language model that can generate shell code for deploying models using MLFlow. Please provide a natural language input text, and I will generate the corresponding command to deploy the model. The model is located in the file {}.\nInput: {}\nShell command:""".format(model_path, prompt)

get_api_result(prompt_template)

and the prompt will be:

Serve the model using port number 1111, and use the local environment manager

By calling serve_model("<model path here>", question) we get:

mlflow models serve -m <model path here> -p 1111 --no-conda

Once we run this command in the shell, we are ready to make predictions by sending data encoded as JSON to the model. We’ll first generate the command to send data to the model, and then create the JSON payload to be inserted into the command.

def send_request(prompt):
prompt_template = """You are a ChatGPT language model that can generate code for sending data to deployed MLFlow models. Please provide a natural language input text, and I will generate the corresponding command. \nInput: {}\nCommand:""".format(prompt)

get_api_result(prompt_template)

The following request will be inserted into the prompt template in send_request():

Use the “curl” command to send data “<data here>” to an mlflow model hosted at port 1111 on localhost. Make sure that the content type is “application/json”.

The output generated by GPT-3.5 is:

curl -X POST -H "Content-Type: application/json" -d '<data here>' http://localhost:1111/invocations

It is preferable to have the URL immediately after curl instead of being at the very end of the command, i.e.

curl http://localhost:1111/invocations -X POST -H "Content-Type: application/json" -d '<data here>'

Getting GPT-3.5 to do this is not easy. Both of the following requests fail to do so:

Use the “curl” command to send data “<data here>” to an mlflow model hosted at port 1111 on localhost. Place the URL immediately after “curl”. Make sure that the content type is “application/json”.

Use the “curl” command, with the URL placed before any argument, to send data “<data here>” to an mlflow model hosted at port 1111 on localhost. Make sure that the content type is “application/json”.

Maybe it’s possible to get the desired output if we have GPT-3.5 modify an existing command rather than generate one from scratch. Here is the generic template for modifying commands:

def modify_request(prompt):
prompt_template = """You are a ChatGPT language model that can modify commands for sending data using "curl". Please provide a natural language instruction, corresponding command, and I will generate the modified command. \nInput: {}\nCommand:""".format(prompt)

get_api_result(prompt_template)

We will call this function as follows:

code = """curl -X POST -H "Content-Type: application/json" -d '<data here>' http://localhost:1111/invocations"""
prompt = """Please modify the following by placing the url before the "-X POST" argument:\n{}""".format(code)
modify_request(prompt)

which finally gives us:

curl http://localhost:1111/invocations -X POST -H "Content-Type: application/json" -d '<data here>'

Now time to create the payload:

def create_payload(prompt):
prompt_template = """You are a ChatGPT language model that can generate code for sending data to deployed MLFlow models. Please provide a natural language input text, and I will generate the corresponding command. \nInput: {}\nPython code:""".format(prompt)

get_api_result(prompt_template)

The prompt for this part needed quite a bit of tuning to get the desired output format:

Convert the DataFrame “df” to json format that can be received by a deployed MLFlow model. Wrap the resulting json in an object called “dataframe_split”. The resulting string should not have newlines, and it should not escape quotes. Also, “dataframe_split” should be surrounded by doubles quotes instead of single quotes. Do not include the “target” column. Use the split “orient” argument

Without the explicit instruction to avoid newlines and escaping quotes, a call to json.dumps() is made which is not the format that the MLflow endpoint expects. The generated command is:

json_data = df.drop("target", axis=1).to_json(orient="split", double_precision=15)
wrapped_data = f'{{"dataframe_split":{json_data}}}'

Before replacing <data here> in the curl request with the value of wrapped_data, we probably want to send only a few rows of data for prediction, otherwise the resulting payload is too large. So we modify the above to be:

json_data = df[:5].drop("target", axis=1).to_json(orient="split", double_precision=15)
wrapped_data = f'{{"dataframe_split":{json_data}}}'

Invoking the model gives:

{"predictions": [0, 0, 0, 1, 0]}

whereas the actual targets are [0, 0, 1, 1, 0].

There we have it. At the beginning of this post, we didn’t even have access to a dataset, yet we’ve managed to end up with a deployed model that was selected to be the best through cross-validation. Importantly, GPT-3.5 did all the heavy lifting, and only required minimal assistance along the way. I did however have to specify particular libraries to use and methods to call, but this was mainly required to resolve ambiguities. Had I specified “Log the entire process” instead of “Log the entire process using MLFlow”, GPT-3.5 would have too many libraries to choose from, and the resulting model format might not have been useful for serving with MLflow. Thus, some knowledge of the tools used to perform the various steps in the ML pipeline is required to have success using GPT-3.5, but it is minimal compared to the knowledge required to code from scratch.

Another option for serving the model is to host it as a SageMaker endpoint on AWS. Despite how easy this may look on the MLflow website, I assure you that as with many examples on the web involving AWS, things will go wrong. First of all, Docker must be installed in order to generate the Docker Imager using the command:

mlflow sagemaker build-and-push-container

Second, the Ptyhon library boto3 used to communicate with AWS also requires installation. Beyond this, permissions must be properly setup such that SageMaker, ECR, and S3 services can communicate with each other on behalf of your account. Here are the commands I ended up having to use:

mlflow deployments run-local -t sagemaker -m <model path> --name income_classifier
mlflow deployments create -t sagemaker --name income_classifier -m model/ --config image_url=<docker image url> --config bucket=mlflow-serving --config region_name=us-east-1

along with some manual tinkering behind the scenes to get the S3 bucket to be in the correct region.

With the help of GPT-3.5 we went through the ML pipeline in a (mostly) painless way, though the last mile was a bit trickier. Note how I didn’t use GPT-3.5 to generate the commands for serving the model on AWS. It works poorly for this use case, and creates made up argument names. I can only speculate that switching to the GPT-4.0 API would help resolve some of the above bugs, and lead to an even easier model development experience.

While the ML pipeline can be fully automated using LLMs, it isn’t yet safe to have a non-expert be responsible for the process. The bugs in the above code were easily identified because the Python interpreter would throw errors, but there are more subtle bugs that can be harmful. For example, the elimination of outlier values in the preprocessing code could be wrong leading to excess or insufficient samples being discarded. In the worst case, it could inadvertently drop entire subgroups of people, exacerbating potential fairness issues.

Additionally, the grid search over hyperparameters could have been done over a poorly chosen range, leading to overfitting or underfitting depending on the range. This would be quite tricky to identify for someone with little ML experience as the code otherwise seems correct, but an understanding of how regularization works in these models is required. Thus, it isn’t yet appropriate to have an unspecialized software engineer stand in for an ML engineer, but that time is fast approaching.

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. (CC BY 4.0)

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Responses (4)