Magical Spells that guide you through the Roposo graph
(How DB and Cache collaborate for collaborative filtering)
By late 2015, Roposo was teeming with great content on the many faces of fashion in India. We now wanted to serve only the most relevant posts to our users right from the first time they opened Roposo and for that we needed to know their interests.
We were looking for the sweet spot in the wide spectrum of interests of a user. What better than to identify the people whose style she might like. In late 2015, we decided to revamp user on-boarding to get the seeds to derive this very crucial information.
So, our problem could now be defined like this — when a user joins Roposo, find which users’ style she might like the most.
Our design and product team decided to help us by providing data in the form of initial likes of the user. (Should I write about the bias that this leads to?)
The system that we had to now build, reminds me of the dilemmas of the Hogwarts Sorting Hat quadrupled.
Our team decided to build the Roposo Sorting Hat using collaborative filtering.
Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person Parvati has the same opinion as a person Padma on an issue, Parvati is more likely to have Padma’s opinion on a different issue than that of a randomly chosen person.
The above mentioned algorithm works for a bidirectional graph, for example, Facebook and LinkedIn, where two people are connected when both are interested in each other. For unidirectional graphs like Roposo and Twitter where a person Hermione showing interest in a person Voldemort does not imply that Voldemort would be interested in Hermione. To find the interests in such graphs, we have to go a level deeper.
Let’s look at that with an example. Say Hermione joined Roposo(One can dream!).
1. We get all the people she immediately followed while on-boarding, and the people whose posts she liked. Let this set be A. (It contains Harry, obviously!)
2. Get all the users that follow the users in set A. Let this set be B. (It contains Ginny, well…!)
3. Get the users (other than A) that the users in set B follow. Let’s call this set C. (It contains Ron, because family!)
C is the set of users that we think Hermione would most probably like. Since the set of C is likely to be huge, we sort the c’s by the number of b’s who follow them.
I hope it is clear by now that collaborative filtering is simply traversing a graph — difficult to do in a relational database. Fortunately, Roposo’s main database is a huge graph (Neo4j)!
We were so excited about finally rocking the Neo4j heavy machinery for what it is meant to do the best — complicated queries easily represented in a graph.
Let’s look at all the experiments and trials our team did to be able to do this complex process on a huge data in <200ms.
We started with a complete 3-step query in Neo4j hoping that it will be able to get the result in a matter of seconds if not milliseconds, considering that the query itself is a form of graph traversal.
With less than 1/3rd the data that we have today, it took Neo4j 30 seconds to return the results.
We wanted the results of collaborative filtering to help create the first feed of a newly on-boarded user. That meant the results had to be calculated immediately after the first few likes and follows of the user. We cannot expect our user to wait half a minute on a screen to get the results.
Clearly we needed to do better and it was time for our engineers to cast some optimisation spells.
Our best guess was that the 3-step cypher query would end up loading a lot of data in Neo4j and hence the slow response. Doing this at scale also did not seem feasible. So we decided to implement this ourselves.
We queried for each step separately and did all the calculation in-memory. We found that this brought down the time to ~6 seconds.
But wait! There is more.
The reason that previous approach worked was that the followers of most of the users were found in cache and we didn’t have to access Neo4j.(We cache the followers of all our recent and recently accessed users.) On an average that came to about 90% data in cache (due to other processes warming up the cache for us). So we decided to see if we could do without the impeding 10%.
We found that not accessing the DB at all and solely relying on the cache didn’t have a significant difference on the result. And we ended up speeding up the process to ~ 2 seconds.
Querying Neo4j database for the result set C (people followed by followers of the people Hermione is following) was the culprit time-churner now.
Remember that the size of the final set C is likely to be huge. Well, 99.999% of the time it is HUGE.
We sort this set by the number of follows from the users in set B and select the top 500–1000. From this, a final list of 50 is selected on the basis of their recent activity and engagement. Selecting the final result can be compared to taking the tip of the iceberg after creating the iceberg ice-cube by ice-cube from the ice-cubes in your fridge.
We cache the set of all our recently active users in Redis. The final 50 are most likely going to be a part of this set because of our final selection on the basis of recent activity. The intersection of the recently active users and the HUGE set C (let’s call this set C’) could provide us with a manageable set to do our computation on.
This, however, would not remove the step of finding all the users that are followed by the users in set B.
Like I mentioned earlier, we cache the followers of all our recent users. That means we already had in our cache the set of all the people who follow recently active users. Let’s call this set B’. If we can find the intersection of this set and the set B which we earlier found, we will get a new set B”. The followers of B” are a subset of the intersection of the recently active users and the set of the followers of B.(I hope the diagram below makes it clear)
We now had another path to reach a subset of C — a path that did not need a Neo4j query. Here’s how we implemented this :
- Find the followers of the recently active users — our set B’. Then for each of these followers, we stored the users they were following from the recently active users set.
- We find the intersection of our set B’ with the set B — our set B”.
- We can now use the reverse mapping we stored in step 1 to easily find the users who the set B” followed — the coveted C”.
And with this the entire process now finally takes ~200 ms.
Yay Team Roposo Backend!
Obviously all of this data has to be maintained in cache or in-memory. But we found that the benefits outweighs the cost (and effort) be a huge margin.
The astute among you might be wondering how we managed to get to 200 ms while fetching that huge amount of data from Redis. That’s food for another post. So stay tuned!
(Written in collaboration with Rahul Singhal — one of the masterminds behind this spell.)