How we built a reversible recommendation system using ElasticSearch
For a french version of this article, click here.
Recommendation engines are everywhere. There are a lot of complicated techniques to recommend content, but sometimes, what you need is just a recommendation system to propose specific content to your users using simple rules.
A bit of context
We are building content related features where we need to be able to explain everything that is displayed to a specific user. For exemple, if you go to your home page, and an article called My thoughts on the last season of Game of Thrones is displayed, we need to be able to explain why.
When your recommendation algorithm is basically ordering content based on publication date, this is pretty simple to do. But when it uses user information and stuff like this, that’s when it becomes harder to explain results.
How it works
The algorithm is pretty simple. We define a list of criteria that a content could match, and then we order them by priority. Each criterion must be a boolean.
Let’s say we are recommending articles. In order of priority:
- the user has not read the article
- at least an article tag matches the user tags
- the article is flagged important
Note: This is just an example, and you can define a way bigger list, as long as you order them. The order is purely arbitrary for example purposes.
Now we want to assign a score to each criterion. The scores we will use are powers of 2. We will start assigning from bottom to top.
- the user has not read the article: 2³ = 8
- at least an article tag matches the user tags: 2² = 4
- the article is flagged important: 2¹ = 2
Let’s consider a user called John.
John has not read My thoughts on the last season of Game of Thrones, but the article tags do not match any of John’s tags. Finally, the article is flagged important by the content team.
Our algorithm will sum the score of each criterion that matches to give a final score to the article for the calculation of John’s home page: 8+ 2= 10.
Repeating this process for each article, we can rank articles based on their scores. The highest articles will be displayed on John’s home page.
If you are wondering why we use powers of 2, we will come to that later in the article.
Let’s say I have the following documents in my index:
ElasticSearch gives us a wonderful tool called function score query. It allows us to modify the score of the documents retrieved by a query, based on filter queries.
For example, I can do a first query that will retrieve all the articles, and then use a filter query on those results to change the score of the articles that John has not read. Let’s say that John has read the article with the ID articleId2.
Note: If your query filters documents instead of a match_all, you can use constant_score so that your query score does not alter the final score.
Now we can just add the other functions for the tags and the important flag and we are all set !
Here is the result returned by ElasticSearch:
As you can see, ElasticSearch returns the score of each documents in its query, which is perfect for what is next: Reversing ElasticSearch score. So let’s answer this question:
Why the hell are we using powers of 2 ?
We did not choose to use powers of 2 just because they are cool (but they are). We chose them because they are ideal to store boolean information. 0 stands for false, 1 stands for true. Storing information with powers of 2 is called a bit field (or a bit flag).
Additionally, this technique will produce a one-to-one relation between the input (boolean information) and the output (flag value). It means that we are able to re-create the input data from the output, which is also known as reversible computing.
Note: We are skipping the first byte, because the default score when ElasticSearch matches nothing is 1, so the first power of two: 2⁰ = 1 is already taken.
This is how it works for our article (the red one is skipped):
So … if I convert my score (10) to its binary value (1010), then I can see which boolean information were set to true ?
Yes. But there is a better trick. You can use bitwise operator to extract the information. More specifically the bitwise-AND. What it does is basically what its name means. It applies a AND operation between every bits of two given number. For example 7 & 6 = 6:
Now if we only use one specific bit with the bitwise-AND, we can determine if the flag contains this information because all the other bits will evaluate to 0.
Here we only extract one single information. Because 10 & 2 ≠ 0, the article is tagged important.
Because 10 & 4 = 0, the user tags does not match the article tags.
Using this technique we can easily write a script that would decode a given score, which is exactly what we wanted !
- By using powers of 2 to score documents, we can represent a one-to-one relationship between boolean information and a score.
- We can use this score to rank documents in order to give John the articles that he wants to read.
- We can reverse the score into the boolean information in order to explain John’s home result if we ever need to.
- To implement this technique, we use ElasticSearch function score queries.
- John has interesting recommendation. John is happy.