DD engine sorting through the duplicates (copyright Warner Bros)

Eliminating duplicates in an ever growing content nightmare

Adil Houdhoud
Avito
Published in
12 min readDec 18, 2017

--

Imagine the following: you’re a classified website, you rely on the content created by users to deliver a quality product to the general public with the minimum intervention from your side. Because at the end, you’re just a medium between the buyer and the seller. Now, how would you go about delivering the best user experience while relying totally on the goodwill of the publishers ? that’s a question that we ask ourselves every day of the week (the working days at least) and that is the core of our business as stated by the big boss Rolv Erik Ryssdal

“Empowering people in their daily life”.

One of the main inconveniences for our users is the irrelevance of ads, having to go through countless pages of the same repeated ads just to find what you’re actually looking for, sometimes you even give up when the ideal deal is just one page away. just like when you’re googling something and you find yourself reaching the dark and moist realm of what lies beyond the second page of google search results.

So what do we do? how do we solve this issue with the limited resources that we have?

And then it dawned on us

We sat for about 10 minutes and the following conversation took place

  • Product manager: I keep stumbling upon the same ads again and again, may be we should something about it, users are complaining of the redundancy of certain ads.
  • Engineer: Indeed I was looking the other day for a raspberry PI and there was a flood of ads about desktop computers
  • Product manager: …
  • Engineer: I think we should do something about this…
  • Product Manager: what if we block people from reinserting the same ads?
  • Engineer: how ?
  • Product manager: I don’t want your job
  • Engineer: I’m going to compare each newly inserted ad with the rest of the user’s ads and block it if it was a duplicate
  • Product manager: sounds good to me, how much time do you need?
  • Engineer: two months
  • Product manager: One month seems logical
  • Engineer: how ?
  • Product manager: okay, one month it is
The okayest okay

And then it started, the first version of the duplicate detection project.

Vladimir I. Levenshtein

Dr. Vladimir Levenshtein is considered the father of coding theory in russia, his contributions are present in many aspects of our everyday lives. One of his most well known creations is the foundation of most of today’s spell checking algorithms: “Levenshtein distance”. And that was the key to solving our problem.

What a great guy!

How a spell checker will help detecting duplicates ?

Levenshtein distance is a way to measure the similarity between two strings. It works by calculating the changes necessary to go from a string (A) to a string (B). the greater the distance, the more different the more different the texts are.

Example:

we have two strings, the initial text (I) and the final text (F), we’ll refer to the Levenshtein distance function as DT.

If I = “adil” and F= “adil” then dt(I,F)=0 because the transformation from I to F required 0 steps

If I= “adil” and F= “adim” the dt(I,F)=1 because the transformation from I to F required 1 step (which is l => m)

So the greater the distance, the more different the texts are. And this is what we relied on first to detect when and how duplicate ads are inserted.

That’s a great idea, let’s ruin it

I will illustrate in the following schema how the first iteration of the duplicate detection project was implemented.

The short explanation

Looks pretty basic! but the devil is in the detail, the big box that blatantly states that a check is performed is where the magic happens (or should have happened.. at least) and it’s in there where levenshtein distance is being put to use to solve our duplicate issue.

The long explanation

First an ad is submitted to the ad-insert process, the necessary checks are being performed (Checking for SQL injections lol!). if an ad is deemed “legit” then it is submitted to the “duplicate detector” process, in which the ad is ran against the database to check if the user is trying to republish that same ad about that damn hideous motorcycle that is way too expensive… anyway… the check is done database wise using a stored procedure that return a “score” ,that score is used to deem the ad “duplicate” or “unique”, here is a heavily edited version of the stored procedure that handled it:

Lovely isn’t it ? it was a good solution, I will spare you the details on how it was implemented within the huge system and jump right into the what went wrong.

(N° of users in a group * N° of ads per user* N° of ad parameters) = timeout

So basically, to get a similarity score, the algorithm must compare the new data with the user’s existing data. however, there’s a catch..

Back in mid 2014 there was a recurring issue regarding the uniqueness of the users. because of the rise of the telecom bubble in morocco, getting a new phone number is cheaper than cheese burgers. and since Avito.ma imposes some restrictions regarding the number of ads that can be posted in certain categories, some users resorted to create multiple accounts with multiple phone numbers to post the maximum of ads (users were limited by phone number or by email), so if a user has 3 phone numbers they can create 3 throw away phone numbers and have 3 times the number of allowed ads.

The solution that we came out with was to add a new concept called “User groups”, this is based on an algorithm that uses “certain” parameters to determine whether or not you are the same user using different accounts or not and places your account in your according group. We let the groups grow on their own with no moderation of human intervention. and it proved to be quite handy.

Fast forward to the implementation of the 1st duplicate detection project. Things ran smoothly and we reveled watching the N° of duplicates drop, we even built new daring projects on top of the duplicate trap. little did we know that there was a great danger lurking in the shadows of the Database.

Avito.ma is on of the fastest growing websites in morocco, which means that whatever code we write in 2014 will probably have to be optimized again in 2015, 2016.. etc. and this is the trap that we fell right into. the User groups logic yielded some great results but also resulted in one of the most dangerous situations we have ever encountered as a tech team: CONSTANT DATABASE TIMEOUTS. it started in early 2016, we noticed that database queries are taking a bit more time than usual, we snooped around and realized it was the duplicate check that was taking long, some more snooping we figured out something really scary. the grouping algorithm went WILD, instead of grouping accounts it started grouping groups of accounts. so instead of adding one single account to a group, it ads all the accounts in the same group to a new group. and we ended up with groups up to 500 users and some totaling more than 30000 ads per group. and knowing that the duplicate detection system compares all ad data with stored data and assuming that an ad has about 6 parameters (subject, body, price, address…etc) we could have something like (30000 * 6) comparison operations when inserting a single ad. multiplying that with the insane amount of ads posted daily, we come out with a crazy number of operations per day just to ensure the uniqueness of ads. something had to be done… quickly..

The incredible super mega engineering task force

The CTO called for a task force to be assembled after the day the database gave in and decided to timeout all requests. we were literally removed from our desktops and put on a different floor to fulfill one task: GET THINGS WORKING AGAIN. Initially we were three, 1 super mega devops (AKA the real wonder woman), 1 insane backend developer (Actually insane) and I. We started contemplating different solutions to these multiple issues we’re having… How can we propose one magic solution that will get us out of this mess.. if only there was a system where we can perform rapid comparison operations and where all the necessary data could be stored and retrieved immediately… something completely detached from the Database…

Photo taken the day of the assembly

Elasticsearch

according to their official website:

Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

We studied the idea of implementing Elasticsearch as a duplicate detection system for about two weeks. quick POCs, boring talks, bitter realizations and fruitless meetings… and then we decided we’re going to go all in with this suggestion.

The project itself was divided into multiple sub projects, each of us was leading a specific aspect of the implementation:

  • Setting up elasticsearch cluster and all infra/devops operations (Dounia Alla)
  • Indexing data in elasticseach (Adil Houdhoud)
  • Using elasticsearch as a duplicate detection engine (Hamza Aid)

So basically my task was to ensure that all needed data is indexed properly in ES without affecting database performance.

Indexing data

The first step of indexing the data was to determine which data to be indexed, of course for the sake of duplicate detection we could have indexed just the parameters needed for the check, but that would have been a waste of the potential of this project, so while the project was sold to the management and other stakeholders as a duplicate detection engine, we were secretly working on indexing EVERYTHING we have. For this purpose we needed a good indexing strategy.

The different players

Here’s an overview of the deployed solution:

Database

A postgresql database, nothing special. though we had a secret weapon that facilitated the indexing process, we keep track of all actions performed on ads in separate tables, each ad change has its unique id, timestamp…etc, which means we could easily fully index the ads and update (reindex) changed ads (edits, deletes, deactivations…etc) on elasticsearch.

Notify

Postgres provided a cool feature which allows sending notifications when certain events are triggered, the official Notify documentation:

The NOTIFY command sends a notification event together with an optional “payload” string to each client application that has previously executed LISTEN channel for the specified channel name in the current database.

we setup some triggers to track all changes to data and created and function that gets called every time an action is performed on ad.

This is an example of an insert

  • Index_tracking: A table that we use to track all indexing actions.
  • NEW.id: the new id of the inserted ad
  • ‘Notification_channel’: the name if the channel that the indexing daemon is listening to, used by pg_notify function
  • TG_TABLE_NAME: the name of the table that triggered the event
  • TG_OP: the name of the operation performed (insert, update..etc)

Indexing Daemon

This is a service dedicated to indexing (and reindexing) data. it relies mainly on the ‘notify’ feature of postgres, it listens for the ‘Notification_channel’ we set in the trigger.

When “notified”, it retrieves the ad’s data from the database using the id sent in the notification, then it sends the new data to elasticSearch for indexing.

It works using two mods: batch and realtime. for the batch mode it gathers ids sent in the notifications for a specific period of time (usually 10 to 20 seconds) before selecting the data from the DB to be indexed, Realtime mode on the other hand is… realtime.

ElasticSearch cluster

We’re using Amazon’s elasticSearch cluster service. Reliable and spares us a lot of hassle in maintaining and configuring the service.

  • duplicate Detection microservice

nicknamed OWL, a REST microservice written in golang (by Hamza Aid) that handles the qualification of duplicates using two phases, first it stars by getting possible similar ads for a given ad from elasticSearch using lucene’s moreLikeThis function, then, it runs the results against the original ad info using a Levenshtein distance function.

We also experimented with some Jaro-winkler based algorithms (props to Soufiane Mghanen) but at the moment Levenshtein is our favorite.

Here is a snippet of the Jaro-winkler implementation we’re experimenting with, more at https://github.com/xrash/smetrics:

And then??

Here we have it, some fancy algorithms along with indexed data in elasticSearch. How this system is exploited? and how does it fit within out existing workflow?

The CTO and the product managers cheering for us

Frontend? backend ?

This duplicate detection or DD in short (too late for than now) is implemented in multiple places within our platform(s), I’ll highlight the implementation in two strategic places:

Frontend detection

The frontline of defense, we call it internally “frontend detection” but in reality it all happens in the backend. this one is triggered when an ad is inserted. Right after clicking the “submit” button, a query is sent to the duplicate detection microservice “nicknamed OWL” to get a score. depending on this score we either accept the ad or reject it and display to the user that he already has a similar ad lurking somewhere (we even point the user to the original ad) and that he can either deactivate the old ad or renew it.

The duplicates queue

All ads are moderated, some automatically and some manually. for this reason we have set up a duplicate detection module within the moderation system. The DD module crawls the inserted ads for duplicates and flags each one with a specific… flag. there are two flags:

  • Black flag: this ad is definitely a duplicate, it is rejected automatically
  • Potential duplicates flag: this ad might be a duplicate, it is moderated manually and it’s up to the moderator ( a human being) to judge.
  • An ad that is not a duplicate is simply not flagged.

Leap of faith

The entire project was a leap of faith, we didn’t have much time to contemplate all the possible solutions and we put all the possible faith into this project. and boy oh boy, it did pay off.

Database impact:

We went from constant 100% CPU usage and none stop timeouts to an average of < 30% at peak hours. the impact was instant, within minutes the drop was noticed.. and virtual champagnes were popped.

Grape juice

Impact on the content

The amount of complains that we have received after the deployment of the DD project was just… amazing to watch (not from the costumer support perspective of course) The users that were accustomed to spamming the website with duplicates are now furious about the “new” change. The CS team and the content product manager did a great job in raising awareness about the important of having clean content on the website and how beneficial it is for both buyers and sellers.

Here are some aspects regarding the impact of the DD project on content:

  • 70% drop of duplicates within one year period
  • Improvement of manual moderation time: time spent moderating duplicates is now spent improving the quality of the content
  • A slight drop in content but a huge improvement of the content quality

And that’s it!

Product managers rejoicing after impact study was published

All in all, the project was ACE! it positively impacted content, moderation, QA, revenues and many other aspects of our daily lives at Avito.ma, it solidified our position as the leader of classified websites in morocco (by far) in both content and quality, and also presented a window for the tech/product team to work on something new and experiment with cutting edge technology.

--

--

Adil Houdhoud
Avito

Software engineer and Team leader at Avito.ma (a Schibsted Classified Media company)