Named Entity Recognition: Matching Products to Genders

Published in

Moosend Engineering & Data Science

6 min readMay 13, 2019

What is named entity recognition?

Named Entity Recognition is the process through which we seek and identify information units in unstructured text and classify them into predefined categories including persons, organizations, date-time expressions, locations, etc.

A little backstory

When I set out to put together the latest, fastest Product Recommendation system, despite my best intentions, I couldn’t have foreseen certain aspects.

One of these is Named Entity Recognition, which is road bump number two along the way.

Hey, we’ve already done this before, in part one about Recommender Systems and part two about Product Clustering, and now it’s time for part three.

In this article, we will discuss the improvements we can make in regard to product clustering.

Our main issue is that with the current clustering technique, we have the same cluster products attributed to different genders. Which means it’s not working. But it should. Oh, and it will.

When it comes to social interaction, gender is one of the variables that can divide people into categories.

This is what the recommendation engine that we are currently implementing does, as we try to identify similar customers. Hence, these bits of information will be of major importance.

But before we continue, I’ll need to point out again that we have no information about the customer, other than their product interactions.

Named Entity Recognition: Working with Data from Products

The only data we can use as far as the product itself is concerned is the product title. This means that all the named entity extraction can be done only through text.

The only data we can use as far as the product itself is concerned is the product title. This means that all the named entity extraction can be done only through text.

In Named Entity Recognition the most common recognizable elements are:

Organizations
Names
Brands
Geographical Locations

So, we analyze and extract the selected entities of a text using a parser.

But what are the entities we could recognize from product titles?

In our case, the named entity worth recognizing was the brand name, therefore we need to exclude it from the product clustering process.

This means that the remaining terms stay intact.

However, product titles are too short and most brand names don’t have the appropriate format, so named entity recognition through a parser (for syntax analysis) will not work properly.

Therefore, we will need another process for the recognition aspect.

As Moosend’s employee, I have access to its database which contains a lot of data, including one of the biggest brand lists.

And having a large brand list at your disposal means that you can simply remove the brands that you identify from the products, through a regular expression.

The second named entity recognized is the gender of a product, taking into account the consumers it is addressed to. So, we need to create 4 categories of products: male, female, kids and neutral.

Disclaimer: Now, for the purposes of clarity, I will be following a binary logic of distinguishing between male and female items.

But I kindly ask for the LGBTQ community’s understanding. Lipstick will fall into the female category, and razors under the male one.

When designing a Product Recommender system for one of your clients, you will need to pre-define what falls into what category, be it in terms of gender, be it in terms of age, culture, and so on.

We proceed with the creation of some hard-coded rules to determine male, female and kids’ products.

Some of these rules can be the words “Men”, “Women”, “Kids” in text.

The products that can’t be categorized from the title alone or are unisex products, fall under the “neutral” category.

Named Entity Recognition: Recognizing Customer’s Interest

We can categorize customers by their interests or, in our case, by the gender category they interact with the most.

Customer categorization uses the same categories as product categorization:

Male,
Female,
Kids,
and Neutral

When we complete the process of product categorization, we proceed to calculate the percentage of product genders every user interacts with.

Τhe next step is to set a threshold to the percentages. Those that have a percentage that is higher than the threshold, we categorize to the gender category.

For these steps, I would recommend a threshold between 0.8–0.9, depending on the products you recognize in the previous process.

Below, we present 4 rows of data examples from the process:

Both James and Nick are males but they are interested in different product categories.

In fact, 95% of James’ interactions take place with “male” products and there are 5% interactions with neutral products or products the gender of which we can’t determine, so we clearly categorize James as male.

On the other hand, 92% of Nick’s interactions are with female products, 7% of his interactions are with neutral ones, and 1% is with kids’ products.

Subsequently, we determine that Nick’s interested mostly in female products. Therefore, we categorize him as female.

Eternal loops between customers and products

We are about to feed everything into the algorithm and see how that goes.

One issue of the process is that, in many cases, the product title of a product is not a descriptive measurement for the gender category.

Not every item is conveniently named “MEN’S SHAVING CREAM”. Some are just named “SHAVING CREAM”, while others are named “DREAM CREAM” or “VENUS LEGS” (Marketing teams, what can you do.)

As a result, this leads to a great number of products ending up in the neutral category.

In order to reduce the distribution of the products in the neutral category, in the last step of the process, we create an “exchange loop” (don’t look it up, I made the name up) between products and customers.

More specifically, once we get the biggest part of our data from products and customers, we have to merge information together.

We represent all the recognized customers with their gender product interactions, formatting a vector for every single one, then we take the almost perfectly classified customers to the gender category ( more than 98% ).

Named Entity Recognition Example:

Let’s suppose that Jane purchases a lipstick, a dress, a skirt, and a bag.

We place every product in a vector (like the image below) and we label the recognized products with their gender category, after that, we calculate the percentage of items that fall within the same category.

In detail:

Jane purchases show us that 98% of her products are labeled as female and 2% as neutral so we recognize her as female due to the fact that she interests mostly for “female” products.

Now, we only have the product gender categories that fall between the classified and neutral categories.

This leads to the assumption that neutral products will probably be categorized in the same gender category as the rest.

If we detect the same pattern with the same product multiple times (90% of the data that contains that pattern), we classify the product to the gender category.

Simply put:

If someone has purchased four “female” category items, we will “deliberately” assume that the fifth one will also be “female”.

Correspondingly, when we see “replicas” of similar purchase patterns, we will also assume that these are one or the other, and therefore attribute them accordingly.

After we have categorized all the products that we can identify, we repeat the customer categorization with the new data and we repeat the process till we can’t identify other products and customers.

Conclusion

In my next article, I will take you back to square 1 of Product Recommender systems, only this time we will revisit everything that didn’t work last time.

Essentially, we will be monitoring the performance of our recommendation engine after we applied the BRAND NEW product clustering and the SHINY NEW named entity recognition.