Large Scale Instances Classification in the Wild

Published in

PriceRunner Tech & Data

4 min readJun 18, 2021

At PriceRunner we are are using Technology and Data to solve business critical problems. This is one example of how we leverage Machine Learning to support price comparison across millions of products.

Introduction

At the heart of PriceRunner lies shopping and price comparison, and at our sites you can compare prices of millions of products across 3 countries.

For this to be possible, we need to match incoming offers, i.e., product metadata and prices, from merchants to our internal representations of a product. This has historically been done in 3 different ways:

Using external identifiers (e.g., EAN and ISBN13)
Manually
Rule-based

The first one is mostly out of our control (we aren’t always supplied with these) and the other two do not scale very well. These have however accumulated a significant amount of data which have paved way for doing some Machine Learning, with the goal of developing a scalable system to supplement these matching sources.

So, onto that journey we embarked and what a journey it is!

The problem

The matching problem is challenging, yet interesting from a lot of different views. The label space is very large, with more than 3 million labels. It is also highly dynamic, with products constantly appearing and disappearing. To complicate things even further the problem is an open set, i.e., we don’t have a product for all offerings. All these challenges combined, makes it less feasible to treat it as a normal multi-class problem. We thus opted for a metric learning approach, more specifically a neural network trained with triplet loss.

The objective in triplet loss is to train the network to produce similar embeddings for instances (offers) sharing the same label (positives) while maximizing the distance to those of different labels (negatives). This is framed more formally as below.

Source: In Defense of the Triplet Loss for Person Re-Identification

Instead of the original triplet loss formulation, we employ the “batch hard” sampling technique to squeeze a bit more out of our batches. This means that we sample P products and K offers belonging to these products. Then for every sample in the batch, the hardest negative and positive examples form the triplet. The loss function instead looks like this:

Indexing

Metric learning doesn’t come without its own challenges. With the model decoupled from the classification, all we have is our similarity embeddings. For the embeddings to be useful, we must compare them to something, meaning that we have to calculate the distance of an offer’s embeddings to all our products.

We first tried to do this in a brute force manner, but the search times turned out to far exceed what was reasonable and we had to look elsewhere for solutions. Luckily for us there has been a lot of research into the area of approximative nearest neighbour search.

After some scouting and looking at benchmarks we decided to use HNSW, it has relatively strong performance while supporting some utilities that we value, like incremental index building.

When building our index, we also add known offers to it, beside our products, to increase the classification accuracy of the system. To prevent the index from being populated with embeddings from low quality names, we first filter them based on Shannon entropy where the probability is based on the character frequency in the string. This helps making sure that the index contains longer and more diverse examples.

A first iteration of this system is live, handling roughly 200 million offers every day and helps keep our site populated with prices. It has been steadily increasing its share of matches, largely due to smaller, incremental improvements to the system, and it currently stands for roughly 13% of all the matches.

We have some exciting stuff cooking for the next iteration which we believe will take it to the next level! Stay tuned for more…

Large Scale Instances Classification in the Wild

Written by Morgan Elfvin