How to Use LLMs to Build Better Clustering Models

Justin Swansburg
9 min readMay 13, 2023

--

A quick guide on a new way to leverage large language models to improve the separation across your clusters

Overview

With all the hype around large language models (LLMs), I started brainstorming ways to incorporate them into my everyday data science workflows. I figured if OpenAI’s chatGPT was the fastest service to ever reach a million users, there must be a bunch of clever ways to leverage it.

In the past I’ve written about applying these models to supervised problems, but the rest of this post is going to discuss how to apply them to unsupervised models. In particular, clustering models.

Introduction to Clustering

Clustering is an unsupervised machine learning technique that aims to group similar data points together based on their features. Finding relevant clusters can be helpful for all sorts of problems such as customer segmentation, anomaly detection, and text classification to name just a few. Despite their widespread use, however, traditional clustering techniques still present challenges.

The primary challenge I want to address in this article is choosing how to encode or transform your input features. Generally speaking, you need to transform every feature to the same scale, otherwise, your clustering models will assign weights disproportionately across features.

Need an example? Imagine we had two columns, one reporting length in centimeters and the other in inches. Without first standardizing these measures, our model will infer larger differences across the lengths measured in centimeters (for similarly sized objects) than inches even though the actual length is identical.

Let’s use a feature that consists of various different colors as another example. Typically, many people will choose to one-hot encode this feature into n-1 additional columns where n is the number of unique colors. While this works, it ignores any potential relationship between colors.

Why is this? Let’s consider one of the features in your dataset has the following colors: red, maroon, crimson, scarlet, and green. If we were to one-hot encode this column, we’d get a dataframe that looks something like this:

In euclidean distance space, each of these newly encoded rows is equally far away. In case you don’t believe me, let’s quickly prove it.

def euclidean_distance(vector1, vector2):
if len(vector1) != len(vector2):
raise ValueError("Vectors must have the same length.")

squared_differences = [(a - b) ** 2 for a, b in zip(vector1, vector2)]
distance = np.sqrt(sum(squared_differences))
return distance

This function will calculate the distance between any two vectors. We can test this and compute the difference between red and maroon and compare that to the distance between red and green. With any luck, they should be the exact same:

Boom! 1.41 and 1.41 — equally distant, just as promised.

Can we do better?

Sure, red and maroon are two different colors, but for the sake of our clustering algorithm do we really want the difference between them to be just as large as the difference between red and green? Probably not.

So how do we go about addressing this shortcoming?

If you read the title of this post I’m sure you can guess where this is heading…we’re going to incorporate LLMs! Rather than one-hot encode or standardize our input features we’re going to create a single text string for each row in our dataset and run them through an LLM to get back an embedding.

For this example, I’m going to use the sentence transformers library from Huggingface and a dataset I synthetically created around job applications.

Let’s start with sentence transformers. This LLM works similarly to BERT, except that it’s specially trained to output embeddings at the sentence level rather than the word or token level. These sentence level embeddings do a better job of capturing meaning and are far quicker to compute.

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")

def compile_text(x):
text = (
f"Age: {x['Age']} Gender: {x['Gender'].lower()} Role: {x['Role']} "
f"Hiring Department: {x['HiringDepartment']} "
f"Travel Preference: {x['TravelPreference']} Extracurriculars: {x['ExtraCurriculars']} "
f"Distance From Home: {x['DistanceFromHome']} "
f"Internships: {x['Internships']} Education Level: {x['EducationLevel']} Education Field: {x['EducationField']} "
f"Summary: {x['Summary']}"
)
return text

def output_embedding(txt):
embd = model.encode(txt)
return pd.DataFrame(embd.reshape(-1, 384))

def preprocess_text(x):
txt = compile_text(x)
embd = output_embedding(txt)
return embd

df['combined_text'] = df.apply(lambda x: preprocess_text(x), axis=1)

Now for the dataset. Our dataset includes information about job applicants such as the hiring department, role, age, and education level, among other features. Here’s a snapshot:

The goal is to segment all of our job applicants into different, well separated clusters.

Let’s take a look at how we apply our sentence embedding to each of our job applicants. The first step is to create a single text field by concatenating all of our features into a string.

Age: 28.
Gender: male.
Role: Research Scientist.
Hiring Department: Research & Development.
Travel Preference: Travel_Frequently.
Extracurriculars: nan.
Distance From Home: 4.
Internships: 9.
Education Level: 3.
Education Field: Engineering.
Summary: As you can see, I am very dedicated and I am ready to start at your firm immediately.

Once we get back the rows of newly converted text, we can call the SBERT LLM and retrieve our embeddings. Here, I’ve used the pandas dataframe styler functionality to highlight low and large values to make the table a bit easier to scan:

So far, I’ve only explained why we may not want to use more traditional encoding steps. I haven’t yet explained why we may prefer using embeddings instead.

Rather than make a theoretical case, let me share another concrete example. Much like we explored encoding colors earlier, let’s take a look at roles. I want to test how similar roles are depending on our encoding strategy. Rather than use euclidean distance, I’m going to use cosine similarities. What’s the difference?

  • Euclidean distance is a measure of the geometric distance between two points, while cosine similarity measures the orientation of the vectors.
  • Euclidean distance is sensitive to the magnitudes of the vectors, whereas cosine similarity is not.
  • Euclidean distance has values ranging from 0 (identical vectors) to infinity, while cosine similarity ranges from -1 (completely dissimilar) to 1 (completely similar)

Let’s pick two potential roles: sales representative and sales executive.

The cosine similarity for a sales representative and a sales executive using our one-hot encoding technique is 0.5, meaning they are somewhat related. This makes sense since they’re both sales roles.

The cosine similarity using our new embedding approach is 0.82. The are far more highly related. This makes even more sense since a sales representative and a sales executive are extremely similar roles in practice.

Comparison Time

Now we’ve worked through some of the why, let’s run a test and see if our theory that passing embeddings to a clustering algorithm will improve our results actually holds true.

To start, let’s build out a standard clustering pipeline on our hiring dataset. Since we have categorical, numeric, and free-text features we’re going to need to pre-process and standardize them before we run our clustering algorithm. Instead of manually building out a scikit-learn pipeline (or asking chatGPT to write one for me), I’m going to save some time and leverage DataRobot’s automated ML platform.

Here’s what the final pipeline looks like:

You can see that I chose to use a simple K-Means algorithm with 3 clusters for the purposes of this experiment. Since we’re comparing different pre-processing techniques the clustering method and number of clusters really doesn’t matter. What does matter, however, is how the model’s predicted cluster labels change depending on which pre-processing strategy we used.

Let’s take a look at feature impact to get a better sense of which of our applicants’ features are driving our model’s segmentation

Original clustering approach:

We can see that the applicant’s summary is the single most important feature in our clustering model, followed closely by the hiring department and whether the applicant prefers to travel.

To get a slightly better understanding of our 3 clusters, we can output high level summary statistics. In our case, the following table outputs the average value per cluster for each of our numeric features and the most frequent value per cluster for each of our non-numeric features:

It doesn’t look like we have the best separation across classes here, does it? For some reason, the most frequent hiring department and role across all of our clusters is the same. Even worse, the Research & Development hiring department doesn’t even match the Sales Executive role.

The following table shows the frequency of each unique hiring department and role pairing. We can see that there aren’t any examples of open job postings for sales executives in the R&D department:

New clustering approach:

I have a feeling we can do better than this! Let’s test out running our embedding approach through a clustering pipeline to see if we can get more intuitive groupings.

The only difference between this pipeline and the last one is that we only have to deal with numeric features since our embeddings are strictly numerical. So our final pipeline will look like this:

Unfortunately, we can’t just jump to calculating feature impact like we did last time. We’d have many hundreds of unintelligible features with varying importances that we wouldn’t be able to make sense of. So what do we do?

The answer is we can get a bit clever. Let’s train yet another model (this time a supervised three-class classification model) that uses our original feature set to predict the class labels that our embedding model produced. This way we can reproduce our feature impact chart in an apples-to-apples fashion.

Here are the results:

You can probably tell the feature ordering is a bit different. Great! We’re hoping our embedding approach found a new way to assign each applicant to our three clusters.

Let’s take a look at a table with the same summary statistics we used above:

So. Much. Better. Even just at a glance our clusters look far more distinct now. In fact, we can see that our embedding approach naturally grouped more sales executives applying for sales roles into cluster 2 and more research scientists applying for R&D roles into clusters 1 and 3. Makes sense to me!

Final Test

There’s no clear cut method to test which of these approaches is best. At the end of the day, these models need to be interpreted with a heavy dose of subjectivity. There is one final test we can run though to truly see if our embedding approach produced more separated clusters.

We can run dimension reduction and visualize the clusters across the two primary principal components. This way we can color each datapoint based on the cluster label and view the distribution on a coordinate plane.

First up is our original model, which uses our standard encoding techniques:

We’re only looking at the first two components, but even still, there’s a hefty amount of overlap here. This cluster separation isn’t giving a real warm and fuzzy feeling.

Next up, our new approach that leverages sentence-level embeddings:

Immediately you can see much better separation. There is far less overlap, especially between cluster 2 and clusters 1 and 3, which is a beautiful thing.

Mission accomplished.

That’s the end of this post. Give this technique a shot and let me know how it goes. Follow me on Medium and LinkedIn for more helpful data science tips and tricks. Thanks for reading!

--

--

Justin Swansburg

Data scientist and AI/ML leader | VP, Applied AI @ DataRobot.