How We Leverage Cosine Similarity for Fine-Tuning Dataset Estimation

Published in

Agoda Engineering & Design

6 min readMay 7, 2024

At Agoda, we handle approximately 50,000 emails from suppliers and customers daily, and this number is continuously increasing. Efficiently classifying response emails from suppliers, hotels, or customers is crucial for businesses focused on streamlining their operations and enhancing customer satisfaction. Utilizing AI, particularly GPT models, has transformed how we classify these emails into specific classes. However, the challenge often lies in preparing an adequate & complete dataset for fine-tuning these models to achieve high accuracy.

To address this, the Customer Experience Group (CEG) Automation team at Agoda developed an innovative approach to mitigate the time-consuming dataset creation process for GPT model fine-tuning. By introducing cosine similarity, we can calculate the minimum dataset size required across various classes, thus expediting the dataset preparation phase. This blog post will detail the methodology applied, the outcomes of the experiment, and the pivotal role of cosine similarity in optimizing the fine-tuning process of GPT models for email classification.

Key Challenges in Dataset Preparation for GPT Models

Fine-tuning is widely used in Agoda email automation tasks because it allows us to “teach” GPT our domain knowledge and decrease prompt size simultaneously.

The primary challenge in fine-tuning GPT models lies in dataset preparation. Collecting and labeling a large corpus of emails is time-consuming and requires substantial human effort. Therefore, it is crucial to understand the minimal amount of data that we should prepare to benefit from fine-tuning.

Key Terms Explained

To make the article easier to read, let’s describe the terms used in this article.

Fine-tuning

Fine-tuning is the process of taking a GPT model and further training it on a smaller, targeted data set. Fine-tuning aims to maintain the original capabilities of a pre-trained model while adapting it to suit more specialized use cases. You can think of it as embedding many examples into the GPT itself without putting them into your prompt.

Embeddings

GPT is a large language model typically built using a transformer architecture. Transformers are a type of neural network well-suited for natural language processing tasks. Neural networks cannot work directly with words and sentences — they can only perform operations on numbers. Therefore, we should find a way to transform our text input into an array of numbers to let GPT do its work. And it is where embeddings come into play.

There are multiple ways to calculate embeddings for the text:

Bag-Of-Words
Word2Vec
GloVe
GPT embedding model, etc.

More advanced embeddings do not only convert text to numbers but capture semantic meanings in a high-dimensional space.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space, serving as an indicator of similarity between these vectors.

Developing the Solution

After estimating the time it takes to prepare a relatively large and manually cleaned dataset, we understood that it is crucial to reduce the amount of data we use as much as possible.

Initially, we noted that it was easier to classify different texts like “yes” and “no” than relatively close sentences like “blue bicycle” and “cyan bicycle”. But how can we measure the similarity between two texts? Numerous algorithms exist for converting texts into comparable numbers or vectors.

For example, TF-IDF is more suited for data retrieval, Word2Vec helps with semantics analysis, and GloVe can be used to find word analogies. However, they do not fully help us answer the following question: How can we understand how similar inputs look for GPT? Further research introduced us to OpenAI’s embedding API, which generates embeddings from the input text. These embeddings capture the context and semantics of the texts in the way closest to GPT we can get. It allows us to use them as input for comparison.

So, what is the most effective method for comparing these embeddings? Since they are float arrays, we can consider them vectors and use cosine similarity as the metric. Cosine similarity fits our needs due to the following features:

It considers the directions of the vectors.
Its value varies from -1 to 1, which helps us easily split the well-known range into buckets.

Experiment

The use case used for the experiment is the classification of responses to cancellation fee waiver requests. The distribution of the intents in the dataset that passed QA was the following:

Table 1: Data distribution between classes

“Uncertain” means any other minor intent.

Our focus is to increase precision on the classes we are automating, which are “Waiver Approved” and “Waiver Denied”. If GPT is unsure, we request it to answer “Uncertain”. As the first step, we created vectors depicting all intents by averaging all email vectors of the intent, which gives the “ideal” representation of an intent for GPT. As you can see, some intents are closer to each other, and some are a little further.

The plot gives us a descriptive picture, but we need exact numbers to move forward and estimate the amount of data required. To do that, we calculate cosine similarities for each pair of vectors to get a full table.

*Table 2: cosine similarities between classes*

The next question is how to use these numbers to estimate a t-shirt size of data required for a class. Our main goal is to ensure that GPT will have enough examples to “learn” the difference between classes. Then, it makes sense to take the biggest number in each column — it contains the similarity with the class that GPT finds the closest.

The last thing we should do is to split cosine similarity into t-shirt buckets — it is an empirical work, and we came up with the following approach:

Table 3: t-shirt sizing based on cosine similarity

We ran four fine-tunings with the following data distributions when all preparations were completed. We chose 30 items in the Flat dataset because that is the minimal number of items present for each class (please refer to Table 1).

*Table 4: “M-size”* means the number of items we set for an M-size bucket. Sizes of the other t-shirt buckets are calculated by Table 3 multiplier.

Outcome of Our Experiment

We used the dataset of 5,000 items collected from the professional agent decisions to evaluate fine-tuned models and got the following results:

*Table 5: fine-tuned models evaluation results*

Here:

processing rate — the part of incoming data that our prompt will automate.
coverage — the part of “Waiver Approved” and “Waiver Denied” items that we correctly detected.

Conclusion

Our findings from the experiments demonstrate a significant reduction of up to 30% in dataset requirements. It means that we can painlessly reduce QA effort on fine-tuning dataset preparation by up to 30% (it may take some extra time to find enough examples of some classes) and reduce overall QA effort by at least 15% while maintaining the same level of quality.

This efficiency gain is leveraged without sacrificing the accuracy of the models, which is crucial when automating classification tasks in business operations for rapid digital communication. Moreover, the t-shirt sizing strategy to bucketize cosine similarity scores illustrates an empirical and scalable method that can be applied across different datasets and scenarios.