Design a news feed system


http://i.stack.imgur.com/KPvit.gif

First thing, the candidate should do is ask questions to the interviewer:

  1. Do you mean like facebook, quora news feed?
  2. So, it will fetch all feeds or lets say last 100 feeds?
  3. Will the feed be sorted by created date or some important feeds will be given preferences?
  4. And many more questions..

The interviewer will not expect you to design a facebook like feed which must would taken months in just 45 minutes. So the idea is ‘Keep it simple’

So lets say, we want to design a news feed system which give some weight-age to each feed.

So here, I will discuss Facebook Edge Rank Algo :

An Edge is basically everything that “happens” in Facebook. Examples of Edges would be status updates, comments, likes, and shares. There are many more Edges than the examples above—any action that happens within Facebook is an Edge.

EdgeRank ranks Edges in the News Feed. EdgeRank looks at all of the Edges that are connected to the User, then ranks each Edge based on importance to the User. Objects with the highest EdgeRank will typically go to the top of the News Feed.

http://www.whatisedgerank.com/images/er_algorithm_graphic.png

Affinity is a one-way relationship between a User and an Edge. It could be understood as how close of a “relationship” a Brand and a Fan may have. Affinity is built by repeat interactions with a Brand’s Edges. Actions such as Commenting, Liking, Sharing, Clicking, and even Messaging can influence a User’s Affinity. Affinity can be based on number of the interactions with your friend, weight of each interaction and time since your last interaction.

Weight is a value system created by Facebook to increase/decrease the value of certain actions within Facebook. Commenting is more involved and therefore deemed more valuable than a Like. In the weighting system, Comments would have a higher value than a Like. In this system all Edges are assigned a value chosen by Facebook. As a general rule, it’s best to assume Edges that take the most time to accomplish tend to weigh more.

Time Decay refers to how long the Edge has been alive; the older it is the less valuable it is. Time Decay is the easiest of the variables to understand. Mathematically it is understood as 1/(Time Since Action). As an Edge ages, it loses value. This helps keep the News Feed fresh with interesting new content, as opposed to lingering old content.

So now we know the importance of each feed.

Next is storing this activity.

Ask your interviewer about the use cases of the application.

Our schema would be like this:

id
user_id // user who created the activity
data //may be json object with metadata
edge_rank //calculated using above algo
activity_type // whether its a photo, status, video
source_id //the record that the activity is related to.
created_date //timestamp

We can index on (user_id, created_date) and then perform basic queries.

Publishing the feed

  1. “Push” Model/Fan-out-on-write

As the activity is created publish it to all the recipients. One way is maintain a pub sub model. In which the recepients will be subscribed to the mesaaging queue. Once the activity is generated , its pushed into the queue and the subscribers gets a notification.

2. “Pull” Model/or Fan-out-on-load

This method involves keeping all recent activity data in memory and pulling in (or fanning out) that data at the time a user loads their home page.

Constraints

  1. The push model will start breaking if we have to push the messages to a large audience. Lets say Sachin Tendulkar will have a lot of fans. If this fails there will be a backlog of feeds. An alternative strategy is using different priorities for the fan-out tasks. You simply mark fan-outs to active users as high priority and fan-outs to inactive users as low priority.
  2. The downside of pull model is that the failure scenario is more catastrophic — instead of just delaying updates, you may potentially fail to generate a user’s feed. One workaround is to have a fallback method , e.g get the updates of 10 friends only.

You must be thinking why we chose denormalized activity rather than normalized. So it will also depend on the backend that you are using. The feed with the activities by people you follow either contains the ids of the activities (normalized) or the full activity (denormalized).

Storing id alone will reduce memory usage but add an additional overhead of getting the data based on the id. So if you are building a system in which feed will go to many users , the data will be copied many times and hence memory can be insufficient.

With Redis you need to be careful about memory usage. Cassandra on the other hand has plenty of storage space, but is quite hard to use if you normalize your data.

Interviewer may ask the usage of Redis and Cassandra. Redis is read-optimized and stores all data in memory. Cassandra is write-optimized data store.

So with redis the following approach can be used:

1.Create your MySQL activity record
2. For each friend of the user who created the activity, push the ID onto their activity list in Redis.

A hybrid approach of pull and push model can also be used.

Please feel free to give comments, feedback, or alternative approaches on this answer.

Recommended reads:
http://www.whatisedgerank.com/
http://www.quora.com/What-are-best-practices-for-building-something-like-a-News-Feed
http://stackoverflow.com/questions/1443960/how-to-implement-the-activity-stream-in-a-social-network
http://www.quora.com/Activity-Streams/What-are-the-scaling-issues-to-keep-in-mind-while-developing-a-social-network-feed