Mapping the Universe of E-commerce Brands

Govind Chandrasekhar
Mar 21 · 6 min read

Of the thousands of attributes that we handle while curating product catalogs, the hardest and perhaps most important attribute is brand. Consumers often begin their searches with brand names of the products they’re looking for … which is why our customers (marketplaces, retailers, brands and logistics companies) are keen on having high coverage of standardized values for this field.

Unfortunately, many of the data sources from which we build our catalogs often fail to provide brand as an explicit field … or unwittingly carry an incorrect value in the brand field. In these cases, the challenge falls to us to build datasets that are robust to such issues. Specifically, this entails:

  • extracting brand from unstructured fields like name and description,
  • inferring brand where altogether absent,
  • standardizing the brand string to a unique representation consistent across the catalog.

Over the last 7 years, we’ve tried to tackle this problem in many different ways — hacks, statistical methods, NLP, heuristics, algorithms, annotation and more. In this article, I’d like to chronicle some of the approaches that’ve worked for us.

Before we begin though, let’s delve deeper into the nuances that make this problem particularly challenging:

  1. Multiple Representations: The same brand can be represented in multiple forms — think General Electric vs GE or Apple vs AAPL.
  2. Multiple Brands?: Is the Huawei Google Nexus 6P a Huawei phone or a Google phone?
  3. Long-Tail: Global brand conglomerates are easy to map out, but in a world of D2C brands, there’s no clear way to build a defined universal corpus of brands.
  4. Source Data Reliability: Since there’s no defined universe of brands to work with, we have to rely on the very catalog data that we’re looking to augment as our source of knowledge. It doesn’t help that e-commerce sites often have spelling mistakes — we’ve seen values like Samsnug, Samnsug, Sammsung in production listings quite often.
  5. Proper Nouns: Who’s to say that Samsnug isn't a legitimate brand word? Popular knowledge dictates that it probably isn't, but there are no grammatical constructs that help us determine this, since we're dealing with proper nouns here.
  6. Sparseness: Some brands only sell a single product — it’s difficult to differentiate them from actual mistakes.
  7. Similarly Named Brands: Distinct valid brand names often differ by just one character. Sometimes, there isn’t even a string differentiation — did you know that there are two different brands called Remington, one a personal care brand and another a firearms brand?
remington.com vs remingtonproducts.com

We haven’t been able to scale all of these hurdles, however, we have made several inroads into chipping away at the problem as a whole. Here are some of the more notable approaches we’ve tried, and the insight underlying each one of them.

1. Co-Occurrence across Fields

For example, {"brand" : "Samsung Electronics", "name" : "Samsung Personal Security Window"} allows us to build the understanding that Samsung Electronics is equivalent to Samsung. This works because brand names are often referenced multiple times in a product listing, typically in the name and manufacturer fields.

The approach is particularly potent for handling spelling errors–most people who create e-commerce listings make these mistakes due to haste rather than lack of knowledge. This results in several listings that have the same brand name spelt correctly in one field of the listing and incorrectly in another. We exploit co-occurrence in these cases to mine common spelling errors. What’s more, since most brands are misspelt the same way, the results of these efforts allow us to even fix listings where all instances of the brand are spelt incorrectly.

2. Ngram Probabilities (Language Modelling)

3. Web Search on Bloomberg and the Like

Generic web search is useful as well — if a web search of a brand phrase provides top results from a domain name that looks very similar to the input, then the input brand is more likely to be legitimate.

4. Image Matching

This technique works particularly well for categories of products that are displayed with packaging.

When OCR corroborates the text value of a brand, the entry is deemed valid

5. Named Entity Recognition with Conditional Random Fields and Bi-Directional LSTMs

That said, there still is information in the structure of product names that can be captured. For example, electronic products usually display brand names at the very beginning of the title, while perfumes and beauty products often do so at the end. These relationships can be captured to a non-trivial extent through CRFs and neural networks built on Bi-Directional LSTMs.

Our NER testing framework

6. Inference from the Model Field


These are just some of the techniques we’ve used to tackle brand extraction and normalization. We still have a fair bit of room for improvement as far as solving this problem goes, and it continues to be one of the hardest problems we’ve tried to tackle at Semantics3.

Personally, my key learning in dealing with this problem has been that knowing and applying the latest greatest algorithms isn’t always the most potent skill that a data-scientist can bring to the table. In each of these cases, thinking deeply about the problem and having the right intuition helped us go further than playing around with fancy algorithms did. Most of these approaches were the results of interesting conversations followed by quick sessions of coding and experimentation … with many a failed dataset and experiment born of superficial insights in between.

If you have any thoughts on other methods we could try, do get in touch via govind [at] semantics3 [dot] com.


This article was originally published on the Semantics3 Blog

The Ecommerce Intelligencer

A look at how data is shaping the future of e-commerce, gleaned from our stockpile of Ecommerce product, pricing and customer metadata. Also see www.semantics3.com/blog

Govind Chandrasekhar

Written by

Co-founder @ semantics3.com; govindc.com

The Ecommerce Intelligencer

A look at how data is shaping the future of e-commerce, gleaned from our stockpile of Ecommerce product, pricing and customer metadata. Also see www.semantics3.com/blog

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade