Beyond the hype: GPT-4 vs traditional machine learning for imputing missing data

Emily Bicks

Published in

Data science at Nesta

6 min readOct 17, 2023

A comparison of GPT-4 vs. k-nearest neighbours regression for filling in missing nutritional data

Introduction

The Healthy Life mission at Nesta seeks to halve the number of people with obesity in the UK over the ten years to 2030. We’re working towards that goal by trying to improve our food environment i.e. the types of food that are readily available, accessible and affordable to us in daily life.

In order to accomplish our aim of helping people eat healthier, which we’re measuring via calorie consumption, we need to know the calorie density (kcal/100g) of the food people are buying.

To get a landscape of the current state of food consumption, we’ve purchased a large dataset with information about what items of food and drink people in the UK bought in 2021. However, the dataset only contained nutritional information for a small proportion of the products and was missing details for around 16,000 items.

This article describes our efforts to fill in the missing calorie data for these purchases, as well as code snippets to help you use these methods to fill in your own missing data. We explored two approaches:

Traditional Supervised Learning: use the product descriptions that we have to train a k-nearest neighbours (KNN) regression model
Ask GPT-4: ask GPT-4 to fill in the missing data by providing it with the product description

What we have

The datasets we worked with were:

Product descriptions and calorie density (kcal/100g) for 5,541 products from the dataset we’ve purchased
An additional 22,868 examples of descriptions and calorie density (kcal/100g) from the MenuTracker dataset web scraped by Cambridge University from UK menus and made publicly available (more information can be found in this paper and this github repository)

For example:

Breakfast Roll with Ketchup — 230.2 kcal/100g
Apple & Grape Fruit Bag — 57.5 kcal/100g
McChicken Sandwich — 226.2 kcal/100g

Our goal was to use this data to estimate the calorie density of the ~16k products purchased in the UK in 2021 for which we had no nutritional information.

Method 1: Traditional Supervised Machine Learning

For this method, we embedded product descriptions using a BERT sentence transformer and used those embeddings to train a KNN regression model using 80% of the labelled data we’ve acquired + the extra ~23K examples provided by Cambridge. The model was then tested on the other 20% of our labelled data. The testing set was not used anywhere in the training process, and was used for testing both methods.

How to do it

Step 1: Embed descriptions using BERT’s sentence transformer

from sentence_transformers import SentenceTransformer

# instantiate transformer model
bert_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
bert_model.max_seq_length = 512

# embed train and test data
training_embeddings =  bert_model.encode(list(training_data["product_descriptions"]))
test_embeddings =  bert_model.encode(list(test_data["product_descriptions"]))

Step 2: Use a grid search to find the optimal model configuration, optimising for the root mean square error

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

# define parameter grid
param_grid = {
        "n_neighbors": [15, 25, 50, 75],
        "weights": ["uniform", "distance"],
        "p": [1, 2],
    }

# fit the gridsearch with 5-fold cross-validation, optimising for RMSE
grid_clf = 
  GridSearchCV(
        KNeighborsRegressor(), 
        param_grid, 
        scoring="neg_root_mean_squared_error"
    ).fit(
      training_embeddings, 
      list(training_data["calories"]))

# identify the optimal parameter combination
best_params = grid_clf.best_params_

Step 3: Fit the model on the full training set using the optimal parameters

model = KNeighborsRegressor(
        n_neighbors=best_params["n_neighbors"],
        weights=best_params["weights"],
        p=best_params["p"],
    ).fit(training_embeddings, list(training_data["calories"]))

Step 4: Test the model on the hold-out set which has not been used in the training process

from sklearn.metrics import mean_absolute_error, mean_squared_error

# predict on test set
predictions = model.predict(test_embeddings)

# calculate performance metrics, comparing predicted to actual values
mae = mean_absolute_error(list(test_data["calories"]), predictions)
mse = mean_squared_error(list(test_data["calories"]), predictions, squared=False)

Results

The KNN model was able to predict calorie density with a root mean square error of 71.6 kcal/100g.

The predicted vs. actual plot in Figure 1 (left) shows that the predictions and actuals were linearly correlated, signifying that the model learnt a meaningful relationship.

The residuals plot on the right shows that there was some correlation between prediction errors (predicted — actual) and actual values, meaning that the model was over-predicting for products with a low calorie density and under predicting for products with a higher calorie density.

In theory, this implies that there may be additional information about these products (e.g. ingredients) that the model could leverage to improve predictions.

Figure 1: left — predicted vs. actual calories of the KNN regression model; right — residuals of the KNN regression model

Method 2: Ask GPT-4

For this method, we use the langchain chat models API to ask GPT-4:

“Based on the calories in similar products, provide an estimate of the calories in 100g of a [PRODUCT DESCRIPTION] purchased in the United Kingdom for consumption outside of the home. Please only provide a single number with no words or units.”

We made predictions on the same test set as the previous method so they could be compared directly. We needed to specify “no words or units” in the prompt so the response could be converted to a float and compared with the actual values.

How to do it

Step 1: Create an OpenAI account to generate an API key — pricing varies by model. It is recommended to store your API key in a .env file so it can be loaded as an environment variable

Step 2: Instantiate your model and load in your API key

from langchain.chat_models import ChatOpenAI

# load API key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# instantiate the llm using GPT-4
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-4")

Step 2: Define the prompt

prompt_string = """
Based on the calories in similar products, provide an estimate of the calories 
in 100g of a {} purchased in the United Kingdom for consumption outside of the
home. Please only provide a single number with no words or units.
"""

Step 3: Use the prompt to make predictions for all products in the test set

predictions = [
        float(llm.predict(prompt_string.format(desc)))
        for desc in test_data["descriptions"]
    ]

Step 4: Evaluate model using the same metrics as the previous method

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate performance metrics, comparing predicted to actual values
mae = mean_absolute_error(list(test_data["calories"]), predictions)
mse = mean_squared_error(list(test_data["calories"]), predictions, squared=False)

Results

GPT-4 was able to predict calorie density with a root mean square error of 96.4 kcal/100g.

Figure 2 highlights that there was a weaker correlation between predicted and actual values than when using the KNN model and there was still the issue of overestimating lower calorie products and underestimating higher calorie products.

Further work could be done to fine-tune the model using the training set and evaluate whether that improves the predictive power.

Figure 2: left — predicted vs. actual calories of GPT-4; right — residuals of the GPT-4

What we learnt

In our specific use-case, traditional supervised machine learning was more accurately able to predict calorie density than GPT-4 out of the box.

A similar methodology could be used to compare LLMs with traditional ML approaches for different prediction tasks but one thing to keep in mind with this evaluation methodology is that OpenAI does not publish details about the data used in their training process in the technical report for GPT-4. Therefore, unless your test set is your own proprietary data, and isn’t available on the web, it’s impossible to be certain that it wasn’t part of the pre-training for GPT-4. If the test data was in fact part of the training set it would invalidate the experiment.

It’s also important to note that GPT-4 is non-deterministic and so will not necessarily generate the same output given the same input. Therefore, if using GPT-4 to fill in missing data, the dataset obtained as an output will vary with each run. As a result, any pipelines built on-top of that dataset would not be reproducible from the raw data. Soif reproducibility is important, it wouldn’t be advisable to use GPT-4.

While there are probably many situations where GPT-4 is well-suited to filling in missing data, in ours that wasn’t the case. It’s always important to go beyond the hype because sometimes the tried and true methods are more effective than the latest and greatest!