Product Clustering: A Text Clustering Approach

Published in

Moosend Engineering & Data Science

5 min readApr 15, 2019

In my previous article, I steered the wheel of my ship to “Recommender Systems” islands, but crushed on the rocks of performance of my matrix systems.

So now, one by one, over the articles to follow, I will be picking up each of the variables which did not perform and try to fix them (Spoiler alert: Everything gets fixed in the end).

Let us take a quick scan at where we left off.

“Painspotting”: Finding the issue

As described in the first part of this series of articles, one of the issues we faced was the following: the interaction matrix from our previous article was too large to measure. This was due to the volume of data we gathered which contained duplicate information because the same products or similar ones can be found in different shops.

In other words, if you sell headphones and three of your competitors also sell the same brand of headphones, there will be duplicates on your matrix, slowing you down for no reason.

This is why the purpose of this article is to implement cross-shop identification of the same or very similar products.

“Go over the motions”: Merging information

We will use product information (namely Product Code, Product Title, Product URL and Product Price), as provided by our data set.

Now, every shop uses its own in-house system to track the products. Therefore, Product Codes are unique for every shop.

Worse, we can’t rely on Product Prices either, since prices vary from one shop to another.

This leaves us with two options: Product Title and the Product URL.

Product URL can be a good source of information, if one can build a Web Scraper to drag the data from the website page but, because of the “unstructured-ness” of the HTML, we can’t have a Web Scraper that can work for every website.

This leaves us with just one option, that is, for the whole clustering to be done using just Product Titles.

“Doing the homework”: Text Pre-processing

What is Pre-processing?

Text pre-processing refers to all adjustments that a text must undergo before it is fed to the algorithm.

Follow These Text Clustering Pre-processing Steps

The pre-processing steps that we go through, in regard to the data, are:

First, we identify the brands and remove them from the title, so that we are left with the product name alone.
Then, we remove words which describe colors in order to reduce the noise of the data, since, at this point, we want to categorize products by color. For instance, we want to create a category of “Black Converse All Star shoes 10” and another one of “White Converse All Star shoes 10.5”.
Afterwards, we remove numbers and the units of measurements (if any) from the title, because we want to create groups with very similar products like “Cola 330ml” and “Cola 500ml”.
Finally, we stem the words, that is, remove the suffix of the word in order to find a common root, and remove stopwords altogether.
In order to feed the title data into an algorithm, we convert data to vectors. To achieve that, we use 2 different vectorizers: CountVectorizer, which creates a binary vector with {0,1}, and tf-idf Vectorizer, which assigns a weight to all words, based on the frequency of the word in all vectors. In this case, we use both vectorizers to find the one that works better for us.

Moving on: Text Clustering

What is Text Clustering?

Text Clustering is the process through which one can generate groups within unlabeled data. In most techniques of clustering, the number of groups are predefined from the user, but in this case, the number of cluster groups has to change dynamically.

We can have clusters that contain one single product and clusters that contain 10 or more; the number depends on how many similar products we can find.

Our needs lessen our options, in the world of clustering to DBSCAN. The DBSCAN is a density based algorithm that relies on how close the vectors are to each other, in order to create the group.

DBSCAN Result Groups:

I know what you’re thinking: Does the Coyote ever catch Road Runner?

Not that? Ok, how about:

Why does the DBSCAN fail to cluster the data correctly?

The title of a product is a very short sentence (1–5 words). However, the vectors we create are very large because every unique word from our data composes our vocabulary. This will be the length of our vectors, so we lose all the information.

Dimensionality reduction techniques such as PCA and SVD won’t help resolve this issue, seeing as every column of the transform matrix represents a word.

Therefore, when you remove some columns consequently, you will remove a lot of products.

As the ready solutions we have, don’t work properly, we decide to build a custom clustering process, aiming to find a solution to our problem.

Break your sneakers: Training a vectorizer

When you train a vectorizer, it learns the words that the given sentence contains.

For example, upon being given “Nike Capri Shoes”, the vectorizer learns only these 3 words. This means that when you transform all the other products, their vectors will be filled with 0 except for the ones that contain a word or all the words.

To find how similar 2 vectors are, we measure their similarity with the Euclidean distance. To set 2 words in the same group, the distance has to be higher than our threshold. The groups that we generate are called categories.

Think of our data like a big bucket of products. Categories are useful, because they create smaller buckets which contain related data that we can process.

Now we create subcategories for every bucket by running the same process again with a higher threshold and create the subcategories. Subcategories are the final groups we will use.

Changing gears: Tips to Improve Your Speed

The whole process is a bit time-costly. In order to save time, we will go through all the text pre-processing steps, except for vectorization.

After that, we sort our data based on number of words that the title contains, so the titles with 1 word will go to the top of the list and the titles with the most words at the bottom.

One-word titles will form the majority of our categories, reducing the volume of the data we process.

Success! Now making my way out (with flair)

On our next article, we will continue working with any information we can extract from our product.