Automating Review Evaluation with OpenAI’s GPT-3.5

10 min readSep 6, 2023

Photo by Levart_Photographer on Unsplash

Evaluating reviews manually can be time-consuming and, oftentimes, requires an objective perspective. With the advancements in machine learning and the impressive capabilities of models like OpenAI’s GPT models, automation of such evaluations is no longer a distant dream but a reality.

In today’s post, I’ll walk you through how to use OpenAI’s GPT-3.5 to evaluate and score product reviews, all while integrating this process into a Python environment.

Setting up the environment:

Before diving deep, it’s essential to set up the environment by installing the required packages. If you’re working on Google Colab, these commands will come handy:

!pip install openai
!pip install tiktoken

Loading necessary modules and API Key:

Post-installation, you can load the essential libraries. A critical step here is to safely load your OpenAI API key. The provided code employs a modular approach by importing the key from a Python file (config):

from google.colab import drive
import os
from pathlib import Path
import openai
import requests
import uuid
import pandas as pd
import tiktoken
import sys
import gzip
from config import OPENAI_KEY

Obtaining OpenAI API Key:

To use OpenAI’s GPT-3.5, you must sign up on the OpenAI platform. Once signed up, you can access your API key from the dashboard. Always remember to keep your API key confidential. For enhanced security, avoid hardcoding it directly into scripts. In this guide, the key is imported from a Python file. In the config.py file, write: OPENAI_KEY = “YOUR OPENAI KEY”, and then run below code.

if OPENAI_KEY:
  print("OpenAI Key loaded successfully!")
else:
  print("Failed to load OpenAI Key!")

openai.api_key = OPENAI_KEY

Crafting the Perfect Prompt:

One of the critical aspects of using GPT-3.5 effectively is crafting a well-defined prompt. A good prompt provides clear context, specifies the desired output, and helps guide the model’s response in the right direction. In essence, the prompt is your way of communicating your intent to the model.

For those keen on diving deeper into the art of prompt engineering, Andrew Ng’s guide on chatGPT Prompt Engineering is a goldmine of information.

Evaluating Reviews with OpenAI:

OpenAI’s API provides the capability to get textual completions, perfect for our use case. Here’s how the code sets it up:

def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0  # degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

The function get_completion sends a prompt to the GPT-3.5 model and returns its response.

model: the name of the model you want to use (e.g., gpt-3.5-turbo, gpt-4, gpt-3.5-turbo-0613, gpt-3.5-turbo-16k-0613).

messages in above code — a list of message objects, where each object has two required fields:

role: the role of the messenger (either system, user, or assistant)
content: the content of the message (e.g., Write me a beautiful poem)

temperature : Degree of randomness of the model’s output.

Counting Tokens with Tiktoken:

If you’re cautious about the number of tokens you’re sending to the API (since you’re billed per token), tiktoken is a useful tool. When you submit your request, the API transforms the messages into a sequence of tokens. For more information click HERE.

The number of tokens used affects:

the cost of the request
the time it takes to generate the response
when the reply gets cut off from hitting the maximum token limit (4,096 for gpt-3.5-turbo or 8,192 for gpt-4)

You can use the following function to count the number of tokens that a list of messages will use.

def count_tokens(text: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    token_count = len(encoding.encode(text))
    return token_count

Loading and Preprocessing Reviews:

The provided code reads a dataset of Amazon fashion reviews stored in a JSON gzip file:

# Open and read the gzip-compressed JSON file, and load the first 100,000 rows into a pandas DataFrame
with gzip.open(file_path) as f:
  df = pd.read_json(f, lines = True, nrows= 100000)

#keep only few colmns that we need
df = df[['overall','verified','reviewTime','reviewText']]

#drop the non-verified reviews, and reset the DataFrame's index after filtering rows
df = df.drop(df[df['verified']==False].index).reset_index(drop = True)

#Remove rows with missing values
df = df.dropna(subset=['reviewText', 'overall'])

# Adjust the 'overall' column values to convert a 5-star rating system to a 0-4 scale
df['overall'] = df['overall'] - 1

# Extract the first 1,000 rows for testing and reset their index
df_test = df.iloc[:1000, :].reset_index(drop=True)

After loading, the code processes the data by keeping essential columns, dropping unverified reviews, and handling missing values.

Automating the Evaluation:

To evaluate reviews, we loop through each one and prompt GPT-3 for a score:

responses = []
for text in df_test['reviewText']:
    prompt = f"Please evaluate the following review, providing a score from 0 to 4. \
    Note that 0 denotes a poor review while 4 signifies an excellent one.\
    Your response should simply be a single integer between 0 and 4. \
    If the review references any photographs or similar elements, please disregard them. \
    Here's the review: \"{text}\""
    response = get_completion(prompt)
    responses.append(response)
    time.sleep(10)

This process sends each review text to the model and waits for a response. The sleep time ensures you don’t overload the API with requests.

A note about the backslash

We are using a backslash \ to make the text fit on the screen without inserting newline '\n' characters.
GPT-3 isn’t really affected whether you insert newline characters or not. But when working with LLMs in general, you may consider whether newline characters in your prompt may affect the model’s performance

Evaluating the Model:

Now that we have automated the review evaluation process using OpenAI’s GPT-3, it’s crucial to evaluate how well our model performed. This step ensures that the model’s predictions are in line with what we expect, and if not, we can go back to the drawing board to improve our approach.

For this, we can employ various evaluation metrics to gauge the model’s performance.

# Compute evaluation metrics to assess model performance
accuracy = accuracy_score(result_df['True'], result_df['Predicted'])
print("Accuracy:", accuracy)

precision = precision_score(result_df['True'], result_df['Predicted'], average='weighted')
print("Precision:", precision)

recall = recall_score(result_df['True'], result_df['Predicted'], average='weighted')
print("Recall: ", recall)

f1 = f1_score(result_df['True'], result_df['Predicted'], average='weighted')
print("F1_score:" , f1)

mae = mean_absolute_error(result_df['True'], result_df['Predicted'])
print("mae" , mae)

After running our model on the dataset, we’ve obtained the following results:

Accuracy: 0.531
Precision: 0.5937
Recall: 0.531
F1 Score: 0.5365
Mean Absolute Error (MAE): 0.557

When dealing with multi-class classification metrics, the interpretation of precision, recall, and F1 score is slightly different than in a binary classification setting. The average parameter in the metrics functions specifies how to compute these metrics for multiclass problems:

Weighted Average: This takes into account the true underlying distribution of the classes. Each class’s metric is weighted by the number of true instances of that class. This is particularly useful when there’s an imbalance in class distribution.

Accuracy: It measures how many of the predictions made by the model are correct. In this case about 53.1% of the predictions made by the model were correct. This is a foundational metric, suggesting that there’s room for improvement.
Precision: Precision evaluates the number of correct positive predictions made. The weighted precision gives a score that considers the imbalance of classes. In this case, the model’s positive predictions were correct about 59.37% of the time, but this percentage is adjusted based on the number of true instances for each class.
Recall : Recall measures how many of the actual positive cases we caught. The weighted recall, like weighted precision, adjusts based on the distribution of classes. The model identified 53.1% of all the actual positive instances, considering the class distribution.
F1 Score :This is a balance between precision and recall. An F1 score is particularly useful when classes are unevenly distributed. In this scenario, it means our model has a balanced performance between precision and recall with a score of 53.65%, considering the class distribution.
Mean Absolute Error (MAE): Commonly used in regression analysis, MAE quantifies how close predictions are to the eventual outcomes. A lower MAE indicates better prediction accuracy. On average, there’s an error magnitude of 0.557 between the predicted values and the actual values. This provides insights into the prediction accuracy, suggesting that, on average, the model’s predictions are just over half a unit away from the true values of the ratings.

By computing these metrics, we get a holistic view of our model’s performance. This is essential not only to gauge the current model’s efficiency but also to make iterations and improve upon it.

Understanding Zero-shot, One-shot, and Few-shot inference:

When interacting with GPT-3, you might come across terms like zero-shot, one-shot, and few-shot inference. These terms refer to how the model is prompted on the task:

Zero-shot inference: The model is not provided any example but is directly asked to perform the task.
One-shot inference: The model is provided a single example to infer the task.
Few-shot inference: The model is given a few examples before being asked to perform the task.

In our review evaluation example, the prompt provided can be considered a zero-shot approach.

Few-shot Inference: A Hands-on Example

Having understood the basic principles of zero-shot, one-shot, and few-shot inference, it’s time to see few-shot in action. The goal here is to determine whether by providing GPT-3 with multiple examples, we can influence the model to better understand and execute the desired task.

In the earlier example, our prompt to GPT-3 was akin to a zero-shot approach. We provided context and an instruction, but no explicit examples. In contrast, for few-shot inference, we’ll enrich the prompt with a set of examples to guide the model. This can be thought of as a mini-training phase, even though GPT-3 doesn’t really “train” on these examples. Rather, it uses them as a context to anchor its responses.

Here’s how it’s done:

prompt = f"""
Your task is to evaluate the following review, providing a score from 0 to 4. \
Note that 0 denotes a poor review while 4 signifies an excellent one.\
Your response should simply be a single integer between 0 and 4. \
If the review references any photographs or similar elements, please disregard them. \

<review>: Velcro backs are no good. They don't stay on. Look old after one use.
 0

<review>: Really cheap looking for the price and the thing in the middle fell off after one day.
1

<review>: Super cute but way too big for infants!
2

<review>: Great costume to wear once or twice, but the stitching doesn't hold up to extended use. Very pleased with it.
3

<review>: Bought this for my sons blessing outfit. He looked super handsome in it. I pinned a little bow tie to add some color and it was perfect. The color is actually super white. Material is soft and comfy. I ordered a 3 month size and he wore it at 6 weeks. Was worried it would be too big but it fit perfectly.
4

<review>: \"\"\"{df_test['reviewText'][25]}\"\"\"
"""

response = get_completion(prompt)
print(response)

The structure of this prompt is explicit. It provides GPT-3 with five example reviews and their respective scores, setting the stage for the model to evaluate an unseen review. By providing these samples, we’re offering context to the model about the kind of evaluations we are looking for.

The benefits of using few-shot inference include:

Contextual grounding: Providing examples offers context, which might help the model better understand nuanced instructions or domain-specific tasks.
Flexibility: You can tailor the examples to closely match the nature of the task you want GPT-3 to perform, potentially improving its accuracy.
Reduced ambiguity: With examples, the chances of the model misinterpreting your instructions decrease.

However, it’s essential to note that few-shot inference may not always yield better results than zero-shot or one-shot. The choice of approach largely depends on the nature of the task and the specificity of the desired output.

Few-shot Inference Results: An Analysis

Having experimented with few-shot inference, it’s time to review the performance metrics and contrast them with our initial zero-shot approach.

Here are the evaluation metrics obtained from the few-shot inference:

Accuracy: 0.344
Precision: 0.633
Recall: 0.344
F1 Score: 0.308
Mean Absolute Error (MAE): 0.804

At a glance, it appears that the few-shot inference didn’t fare as well as our zero-shot approach in some metrics. The accuracy is notably lower, indicating that the model’s predictions didn’t align as closely with the ground truth as they did in the zero-shot example. However, it’s interesting to note the increase in precision, suggesting that when the model did get predictions right, it was more confident in those predictions.

The drop in F1_score implies a balance between precision and recall wasn’t achieved as effectively as before. Furthermore, the Mean Absolute Error (mae) indicates that on average, our predictions were off by approximately 0.804 points.

What can we infer from this?

Few-shot inference provides the model with a series of examples to use as a context. In some cases, this might help the model understand nuanced tasks better. However, the nature of the examples provided plays a significant role. The reviews in our few-shot examples spanned a range of scores, from poor to excellent, yet the model might have given more weight to certain examples over others, or perhaps the unseen reviews had nuances not captured by our examples.

It’s also worth noting that GPT-3 doesn’t “learn” in the traditional sense from these examples. Instead, it uses them to contextualize its responses. Thus, while it’s provided with a few prior examples, its vast training data still heavily influences its decision-making. The better approach might be fine-tuning which is now available through openai.

The contrast in results between zero-shot and few-shot approaches in our experiment underscores the importance of iterative experimentation with GPT-3. Depending on the nature of the task, the data at hand, and the desired output, users might find varying success with different prompting strategies.

While few-shot didn’t outperform zero-shot in our specific case, it remains a valuable tool in the GPT-3 arsenal. It showcases the flexibility and adaptability of the model, providing users with diverse options to tailor and optimize their interactions.

Conclusion:

Thanks to tools like OpenAI’s GPT-3.5, we can now easily automate tasks like checking reviews. The given code is just a starting point, and we can always make it better. In the world of technology, every step we take, whether big or small, helps us learn and improve. By experimenting and pushing the limits of tools like GPT-3.5, we’re finding better ways to use them. So, keep exploring and happy coding!