News Feed is a feature of the social network Facebook. The web feed is the primary system through which users are exposed to content posted on the network. News Feed highlights information that includes profile changes, upcoming events, and birthdays, among other updates. (Source Wiki)
- Generated from People Page and Groups a user follows
- Users should be able to see live posts
- Users should be able to scroll down
- User should be able to like and comment
- Need Feed should display text, video , links and images
Non Functional Requirement
- High Available
- Low Latency
- Eventual Consistency
- Real time news feed in 2sec and 5 sec to fan out new posts
- 200 Mil Daily users each fatches minimum 5 times a day
- This leads to around 1 B request per day to fecth news feed
- 11, 500 Req per second
- 500 Post in each users Memory at a time
- 1 Post are Videos
- 19 Post Images
- 250 Post Links
- 230 Post Text
- Each text post average 1KB
- Each Image 100KB
- Each Video 10 KB (Stored as Link and then downloaded when you reach the video so it wont be stored in memory)
- Total Storage 1.4 MB Per User ~1MB
- 200 Mill Users total data = 200TB Data
- 1 Server Can store upto 100GB then we will need 2000 Machines
High Level Design
- Goal is to maximize cache hit ratio
- Facebook Cache stores Photos and Videos on NAND Flash based SSD’s
- SSDs have much higher capacity then DRAM and much higher read rate then Hard Drives
- SSD suffer from write implications — A single logical write results in multiple physical writes to the device. Higher write amplification is undesirable because it decreases the throughput and the lifespan of the SSD.
- Relative priority queue interface of RIPQ to implement a caching algorithm:
- insert(x,p) on cache miss,
- increase(x,p’) on cache hit,
- delete-min() is called implicitly for cache eviction.
We can have SOAP or REST APIs for getting the field.
Get User Feed
Call : getUserFeed(api_dev_key, user_id, next_id, count, max_id)
Return : 200 response
JSON containing the news feed ordered by id
3 Main Actors :
User Can Follow User or Page
Both User and Page can post feed link, image , video or text
Each FeedItem will have ID of user who created it
Each FeedItem can optionally have PageID
GraphDB are generally suited for managing friend relationship and recommendations.
- Cache is extremely important, user can cache multiple things
- News Feed ID
- Content — Popular Content in hot cache
- Social Graph — Store user relationship data
- Action — What action taken by user
- Counters for like , dislike etc
Feed Generation Steps
- Retrieve IDs of all users and entities that user follows.
- Retrieve latest, most popular and relevant posts for those ID.
- Rank these posts based on relevance.
- Store this feed in the cache and return top posts
- When User scrolls down new page gets loaded from server and old gets pushed out of Cache
We need mechanism to load new posts when user is online, we can do periodically say 5 min new loads
Detailed Component Design
select FeedItemID from FeedItem where userID in (select followerid in Follower where userId in current_user_id and type = 'Friend') Union select FeedItemID from FeedItem where userID in (select followerid in Follower where userId in current_user_id and type = 'Page')
- Celebrity problem — Slow for people with lots of connection.
- Generating entire time on load can be painfully slow
- Live updates result in feed update for all followers making it further slow
- For Live posts, server notifying or pushing can lead to heavy loads, best it pre generate the timeline
Solution — Offline generation of news feed. Dedicated servers that generates userfeeds and store them in memory. This way user feed is not compiled on load.
Feed generated based on Last Generated Feed.
Fanout — Pushing Feed to all users
Push — Fanout on Write
Pull — Fanout on Load
- Keep all the recent feed data in memory and can pull it from the server whenever they need it
- Manually or on Request basis
- Stale data till user issues pull
- Most pull request result in empty response as there are not many updates resulting in waste of resources
- Once user publishes post it can be published immediately to all followers-Quick
- Less Reads and also no need of pulling for each follower
- Faster due to feed precomputed
- Long Poll for receiving updates
- HotKey Problem if user has lot of followers has to push to many
- For users rarely logs precompute is waste of time
- Pull for users with less followers
- Post for Celebrity
- Or Limit fanout to online friends
Other things to consider
Hotkey problem can be resolved with consistent hashing.
Vertical Scaling vs Horizontal Scaling
SQL vs NoSql
Master Slave Replication