Practical Considerations for Developing Applications with ChatGPT on Large-scale Data Sets

Albert Pujol Torras
Hotel Tech Stories
Published in
11 min readJun 20, 2023
Image generated with BlueWillow.

This post, a sequel to a previous discussion on the influence of language on ChatGPT’s performance in text classification tasks, seeks to provide practical insights into developing applications with Large Language Models (LLMs) for handling large volumes of data.

At The Hotels Network, we offer a dynamic content customization platform to over 19,000 hotels worldwide, improving user experience and website performance. Understanding our clients’ web pages and user behavior is crucial, and the multilingual capabilities of Large Language Models (LLMs) are key.

Managing vast volumes of web content and daily user visits necessitates specialized LLM strategies. In this post, we aim to delve into some of those strategies, outlining key aspects of application development and deployment. We also discuss three methodologies for handling large data volumes: training machine learning models to reproduce ChatGPT completion on embeddings, employing information retrieval with embeddings to form prompt context, and responding to natural language inquiries on tabular data using ChatGPT-generated code.

Functional Considerations

In this section, we delve into several aspects that we believe are relevant when designing an application with ChatGPT API. These factors, encompassing both technical and non-technical considerations, heavily influence the feasibility and viability of different approaches when addressing data-intensive problems.

1.-Cost: The cost of ChatGPT API calls is determined based on the number of tokens (words or word fragments) in both the input question and the generated response. As of June 16, 2023, these costs vary significantly (as shown in the following chart), especially between text completion requests and embedding generation requests.

Cost per 1k prompt tokens for different ChatGPT completion and embedding models
Cost per 1k generated tokens for different OpenAI completion models

The costs of text completion models may deem them unsuitable for certain large-scale applications, especially when compared with the cost-efficiency of specific embedding-based processing methods. Given the significant dependence of these models on the volume of generated tokens, it’s essential that the prompts encourage outputs to minimize unnecessary characters, symbols, and irrelevant comments, thereby optimizing automated processing efficiency.

2.- Latency: API response times significantly impact user experience and application performance. The response time of the ChatGPT API text completion is proportional to the number of tokens generated in the response. This relationship is depicted in the following graph, which illustrates the measured latency for various text completion experiments. The graph demonstrates the time taken to generate the response based on the number of words generated, considering different ChatGPT models.

Latency for different API requests depending on model and the number of words generated in the response

According to the empirical results presented in the following graph and tables, GPT3.5 shows slightly faster performance compared to the other alternatives considered.

Mean latency per word generated for ChatGPT models.

For this reason, it is advisable when designing the prompt for a text completion system to explicitly ask it to avoid adding unnecessary comments or texts. Similarly, when indicating the desired format for the output to fit automatic processing, we should avoid unnecessary symbols and punctuation marks that will end up being tokens and slow down the response time of our system.

3.-Request Limitations: Another important consideration when handling large volumes of data is the rate at which we can make API requests before being blocked by OpenAI servers. The OpenAI API has recently relaxed the limits on concurrent requests, as summarized in the following table based on values provided by OpenAI.

Limitations on the number of requests and tokens per minute when utilizing OpenAI completion models as of June 2023.

Limitations in latency and request rates often render large-scale text completion with GPT impractical, especially in circumstances that necessitate swift response times or extensive data processing.

4.-Data Privacy: Data privacy is a factor that requires careful consideration. We must question whether we are willing or able to send our data or our clients’ data to third-party services. Solutions based on embeddings can alleviate these privacy concerns as far as we can generate them locally, eliminating the need to send data to other companies. This approach ensures both efficiency and data security.

5.-Prompt strategy: Writing an effective prompt is crucial. A well-constructed prompt is designed to yield results aligned with our goals. There’s a wealth of information available online about prompt generation strategies. Key aspects to consider in prompt creation include:

  • Clarity in defining the context and goal.
  • Explicitly indicate adherence to the provided context to prevent deviation from the facts reported within that context.
  • Outlining the desired response format to ensure two main outcomes: First the response is directly usable by automatic methods, or requires minimal additional processing. Second, the response includes only the information we need, in the shortest possible length, or with the minimum number of generated tokens. This is important as it has been shown, unnecessary tokens will increase both costs and latency.

In any case, it’s a sound idea to gauge the outcomes of various prompt options. There can be substantial differences in the results, and the optimal prompt technique may be contingent on the model in use.

6. Model consistency: The development of Machine Learning algorithms often faces the dilemma of improving the algorithm’s accuracy or maintaining the model’s output consistency. Although optimizing responses can be beneficial, output consistency is crucial to ensure users trust the model’s outputs and to prevent alterations in processes based on these outputs. This problem is also noticeable with OpenAI, as “improvements” in the models can change their responses, forcing users to adjust their interactions. Again the training of models, for simple tasks, using embeddings of open models as inputs and the outputs obtained by GPT as desired outputs isolates from OpenAI model adjustments.

Comparison of GPT-3.5-Turbo before and after June 13 on multi language text content classification.

The preceding graph displays the performance impact on the gpt3.5-Turbo model in two versions, before and after June 13, 2023. It showcases a significant decrease in system accuracy in experiments related to text content categorization, similar to the format of experiments conducted in the previous article. This article measures the performance of models based on text classification across different languages. Although the experiment was not designed to compare models, but to compare the effect of prompts and language within the same model, the results clearly indicate a degradation in accuracy when using the exact same prompt and model across two distinct time periods, when the text to be classified is not written in English.

Strategies for Scalable Usage of ChatGPT

Prior sections have highlighted the challenges of working with completion GPT models at scale, particularly in terms of cost, latency, and request limitations. This section introduces three distinct strategies to navigate these problems when dealing with large datasets. We will demonstrate each strategy through practical examples of Python implementation:

1.-Model Training to reproduce ChatGPT completions from embeddings: ChatGPT can be employed to solve complex tasks generating responses for a specific subset of text samples, thereby producing a training data set composed of both ChatGPT requests and replies.

Utilizing this dataset, a model can be trained to map the embeddings of the ChatGPT request texts to the corresponding desired responses generated by ChatGPT. This strategy becomes particularly effective when the responses to be completed can be expressed in categorical, numerical, or binary terms.

Such an approach facilitates the substitution of original texts with their corresponding embeddings. These embeddings can be computed using OpenAI models, or even other models free for commercial use, offering a cost-effective alternative to conventional text completion methods.

2.- Context prompt Creation: This strategy employs embeddings to identify and index content that is similar or related to the query that needs to be answered. By leveraging Langchain, vectorial databases such as Pinecone and Nuclia and model embeddings, it generates context from a vast amount of text data. This context, which consists of similar or related texts, is then appended to the original query. Consequently, the enriched query, now carrying relevant context, is forwarded to ChatGPT for processing.

3.-Using ChatGPT to generate Data Query code: This strategy allows for accessing voluminous tabular data by using automatically generated data query functions within pandas data frames, being PandasAI an implementation of this strategy.

We will dedicate the rest of this article to exploring in detail the first strategy, with the intention to discuss the following strategies in an upcoming post.

Training Classifiers on Embeddings to Mimic GPT Chat Behavior

To illustrate the first strategy described, let’s consider this problem: Our goal is to rank hotel recommendations based on the positivity of the image they project. We will ask ChatGPT to compare two reviews and indicate which one presents a more positive image of the hotel. Once we’ve obtained these individual comparisons, we can establish a positivity rating. This rating could be derived from the number of times a hotel is deemed more positive when compared with others, or through more sophisticated metrics such as the Elo rating.

However, due to system latency and costs, this process with ChatGPT isn’t feasible. Instead, we can use embeddings and classifier training to replicate the behavior of GPT. Fist we will select a set of recommendations, form pairs, and ask ChatGPT which of each pair projects a more positive connotation. Based on these responses, we will train classifiers using the embeddings of the text pairs as inputs to mimic the behavior of ChatGPT.

The prompt used to determine which of a pair of reviews is more positive could look something like this:

import os
import openai

openai.organization = "my_organization"
openai.api_key = os.getenv('OPENAI_KEY')


def compare_review_sentiments(message1: str, message2: str, model: str = "gpt-3.5-turbo") -> str:
prompt = f'Based on the following two hotel reviews:\n\n1: "{message1}"\n2: "{message2}"' \
'\n\nPlease determine which review portrays a more positive impression to potential customers.' \
' Choose between "1", "2", or "TIE" if both reviews seem equally positive or if it is '\
'unclear which is superior. Your response should only include "1", "2", or "TIE", '\
'without any additional comments. The most positive review is:'

# Create a chat completion request with GPT-4
result = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "You are an expert in marketing with specialized experience in hotels."},
{"role": "user", "content": prompt}
],
temperature=0,
max_tokens=3000
)

# Extract and return the response
return result['choices'][0]['message']['content'].strip()

Then, given a list of reviews, we could generate a dataset containing random pairs of reviews:

import numpy as np

reviews = ['review 1', 'review 2'....'reciew n']
n_reviews = len(reviews)
r1 = list(range(n_reviews))
r2 = [np.random.choice([y for y in list(range(n_reviews)) if y != x]) for x in r1]

We then use ChatGPT to construct the order within each pair determined by which review is more positive.

import time

n_reviews = len(reviews)

forward = []
backward = []
i = 0
while (i < n_reviews):
try:
# Compare sentiments from review 1 to review 2, and vice versa
f = compare_review_sentiments(reviews[r1[i]], reviews[r2[i]])
b = compare_review_sentiments(reviews[r2[i]], reviews[r1[i]])
forward.append(f)
backward.append(b)
i += 1
except Exception as e:
# sleep when OpenAI API raises error due to request per minute limit
print(f'Processing pair {i} resulted in an error: {e}. Entering sleep mode...')
time.sleep(20)

We filter for pairs of reviews with a clear sentiment difference, specifically those where the forward and backward assessments are opposite and none have been classified as a tie, in our experiments 78% of the pairs were clearly classified. We will use this dataset to train and test the models.

We will evaluate the approximation process of determining which of the two reviews is more positive by calculating embeddings using OpenAI’s API with the ‘text-embedding-ada-002’ model, and we will compare it with the embeddings generated by ‘all-mpnet-base-v2’, ’stsb-xlm-r-multilingual’, ‘all-MiniLM-L6-v2’, ‘bert-base-multilingual-uncased-sentiment’ models, all of them apache 2 or MIT license distributed models that we may have installed on our machine.

The code to generate the embeddings with text-embedding-ada-002 OpenAI would look like this:


MODEL = "text-embedding-ada-002"
def generate_embeddings(inputs):
res = openai.Embedding.create(
input=inputs, engine=MODEL
)
return [x["embedding"] for x in res['data']]


ada_embeddings_result = []

# Process reviews in chunks of 100
for i in range(0, len(reviews), 100):
# Get next chunk of reviews
chunk = reviews[i:i + 100]

# Generate embeddings for chunk
embeddings = generate_embeddings(chunk)

# Append embeddings to result list
embeddings_result.extend(embeddings)

The reviews embeddings of the remaining models could me generated with the following code:

import torch
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F

LOCAL_MODEL_CACHE_PATH = "/my_hugin_face_local_model_cache_folder"

def _mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


class TextModel():

def __init__(self, model_name, device='cuda:0'):
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name,
cache_dir=LOCAL_MODEL_CACHE_PATH)
self.model = AutoModel.from_pretrained(self.model_name,
cache_dir=LOCAL_MODEL_CACHE_PATH).to(device)
self.device = device

def encode_text(self, texts):
encoded_input = self.tokenizer(texts, padding=True, truncation=True, return_tensors='pt').to(self.device)
with torch.no_grad():
model_output = self.model(**encoded_input)

sentence_embeddings = _mean_pooling(model_output, encoded_input['attention_mask'])

sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
sentence_embeddings = sentence_embeddings.to('cpu')

return sentence_embeddings.numpy()

embeddings_result=[]
for block in tqdm(range(1+len(dataset['reviews'])//25)):
all_encodings.append(embeddings_model.encode_text(dataset['reviews'][block*25:(block+1)*25]))

embedding_model=TextModel(model_name)

embeddings_result = []
# Process reviews in chunks of 25
for i in range(0, len(reviews), 25):
# Get next chunk of reviews
chunk = reviews[i:i + 25]

# Generate embeddings for chunk
embeddings = embedding_model.encode_text(chunk)

# Append embeddings to result list
embeddings_result.extend(embeddings)

Models are downloaded the first time are used from huginface and stored in ”/my_hugin_face_local_model_cache_folder” in your local disk. The complete name of the model has to be provided to the TextModel constructor, complete model names are: “sentence-transformers/all-mpnet-base-v2’, ”sentence-transformers/stsb-xlm-r-multilingual”, ”sentence-transformers/all-MiniLM-L6-v2", “nlptown/bert-base-multilingual-uncased-sentiment”.

Once we obtain the embeddings, we can proceed to train and evaluate our models. To illustrate this process, we conducted experiments using a smaller dataset of 1,000 reviews. We trained neural networks on different topologies, applied to a dataset with reduced dimensionality via Principal Component Analysis (PCA), utilizing 100 eigenvectors. We measured the accuracy of this entire process using 10-fold validation. The results for each set of embeddings are presented in the subsequent chart.

Comparison results on ChatGPT-3.5-turbo completion approximation using the embeddings of different models

The results from this reduced dataset suggest that we can closely approximate the outcomes achieved with ChatGPT completion in this task using embeddings. Additionally, the ‘bert-base-multilingual-uncased-sentiment’ model, fine-tuned for sentiment classification, performs comparably to the OpenAI model. A key advantage of the ‘bert-base-multilingual-uncased-sentiment’ model is that it can be run locally, thereby circumventing data privacy concerns and request limitations associated with the OpenAI model.

Conclusions

ChatGPT’s versatility allows for the automation of a variety of complex tasks. However, when we use it in production systems with large data volumes, we encounter issues related to latency, API request limits, costs, consistency through model updates as well as privacy.

These problems can be mitigated or even eliminated by using ChatGPT to label datasets. These labeled datasets can then be used to train models that, using embeddings, replicate the behavior of ChatGPT.

--

--