TAO — Facebook’s Distributed database for Social Graph
I will be covering the architecture and key design principles outlined in the paper that came out of Facebook on graph databases. This is an attempt to summarize the architecture of a highly scalable graph database that can support objects and their associations, for a read heavy workload consisting of billions of transactions per second. In facebook’s case, reads comprise of more than 99% of the requests and write are less than a percent.
Facebook has billions of users and most of these users consume content more often than they create content. So obviously their workload is read heavy. So they initially implemented a distributed lookaside cache using memcached, which this paper references a lot. In this workload, a lookaside cache is used to support all the reads and writes will go to the database. A good cache-hit rate ensures a good performance and doesn’t overload the database. The following figure shows how a memcache based lookaside cache is used at facebook for optimizing reads.
While this is immensely useful, most information in Facebook is best represented using a social graph and the content that gets rendered on a page is highly customizable depending on users privacy settings and it is personalized for every user. This means that the data needs to be stored as-is and then filtered when it is being viewed/rendered.
See the social graph that gets generated on a typical simple activity such as: “someone visited the golden gate bridge with someone else and then a few folks commented on it”
Representing this information in a key-value store like lookaside cache becomes very tricky and cumbersome. Some of the key motivations for having a native graph based store is:
- One possible implementation is to use a formatted list of edges as a single value. But that means that every access would require loading of the entire edge-list and same more modification of an edge-list. One could introduce native list types that can be associated with a key. But that only solves the problem of efficient edge-list access lookup. In a social graph, many objects are interlinked and coordinating such updates via edge-lists is tricky.
- In the memcache implementation at Facebook, memcache issue leases that tell clients to wait for some time and that prevents thundering herds(read and write on the same popular objects causing misses in cache and then going to database). This moves control logic to clients and since clients don’t communicate with each other, it adds more complexity there. In the model of Objects and Associations, everything is controlled by the TAO system which can implement these efficiently and hence clients are free to iterate quickly.
- Using graph semantics, it becomes more efficient to implement read-after-write consistency model.
So TAO instead provides Objects and Associations as the basic units of access in the system. It also optimizes for heavy reads and is consistent most times, but in case of failure cases it provides eventual consistency.
The data model consists of two main entities:
Object: It maps “id” to “Key, ObjectType, Value”
In the example above, Alice is an Object of type User. Also a comment that was added by Cathy is an Object of Type comment with the text of “wish we were there”. Objects better represent things that are repeatable, like comments.
Association: It maps “Object1, AssociationType, Object2” to “time, Key, Value”. Associations represent relationships that happen at most once — Two friends are connected at most once using an association. The usefulness of the time field will becomes clearer in the following sections on how queries work.
In the example above, Alice and Cathy are associated with each other using an association type of friend. Also the two objects of checkin and the golden gate location are connected to each other. The type of association is different in each direction. Golden Gate location object is connected to checkin object using checkin association type. While the checkin object connects to the golden gate location object using location association type.
APIs on objects and associations
Object APIs are straightforward and they allow for creation, modification, deletion, retrieval of objects using their ids.
Association creation, modification and deletion APIs basically mutate the link accordingly between the two object ids with an association.
More interesting are association query APIs. This is where the power of graph semantics comes into the play. Consider queries such as:
“Give me the most recent 10 comments about a checkin by Alice”
This can be modeled as assoc_range(CHECKIN_ID, COMMENT, 0, 10). This is also where time field attached to the associations comes in handy. The time field can be used to sort queries like this easily.
“How many likes did the comment from Cathy have?”
assoc_count(COMMENT_ID, LIKED_BY) This query will return number of “likes” that was associated to a checkin.
At a high level, TAO uses mysql database as the persistent store for the objects and associations. This way they get all the features of database replication, backups, migrations etc. where other systems like LevelDB didn’t fit their needs in this regard.
The overall contents of the system are divided into shards. Each object_id contains a shard_id in it, reflecting the logical location of that object. This translates to locating the host for this object. Also Associations are stored on the same shard as its originating object(Remember that association is defined as Object1, AssociationType, Object2). This ensures better locality and helps with retrieving objects and associations from the same host. There are far more shards in the system than the number of hosts that host the mysql servers. So many shards are mapped onto a single host.
All the data belonging to objects is serialized and stored against the id. This makes the object table design pretty straightforward in the mysql database. Association are stored similarly with id as the key and data being serialized and stored in one column. Given the queries mentioned above, further indices are built on association tables for: originating id(Object1), time based sorting, type of association.
Like in the memcache paper, it is still very important to offload database workload using a caching layer. A client requesting information connects to a cache first. This cache belongs to a tier consisting of multiple such caches and the database. They are collectively responsible for serving objects and associations.
If there is a read-miss then caches can contact the nearby caches or go to the database. On a write, caches go the database for a synchronous update. This helps with read-after-write consistency in most cases; more details on this in the following sections.
Scaling the caching servers: Regions, Tiers, Leaders and Followers
It is obvious to think that one can keep on adding more caching servers to a tier. But this can make the given tier very fat and thus prone to hot spots. Also cost of communication can grow quadratically as the tier grows fatter.
Hence the idea is to have a two level hierarchy. A region will consist multiple tiers, but only one tier will be a leader tier and the rest will be follower tiers. Read misses(not satisfied by colocated peers can go the the follower tiers) and Writes will always go the leader tier in the region. Read hits will be addressed by the follower tier where the request landed from the client or by another follower tier. In this way, the hotspots can be alleviated by consistent hashing which makes addition of tiers easier without rebalancing caches a lot. In addition, followers can offload read workload for popular objects all the way to the client and clients can cache them for longer.
With this hierarchy, one big advantage is that there is only one leader tier coordinating all access to the database. So the leader tier is naturally consistent and always upto date. In addition, it can protect the database if a thundering herd arrives and rate limit pending queries to the database and also avoid overlapping range scans which are detrimental to performance.
In this architecture, there are multiple follower tiers operating independently of each other. Tier 1 can make an update to a value and tier 2 has no way of knowing about the new value. Hence the follower tiers need to be made aware of changes that originated via other follower tiers. This is achieved by cache maintenance message that the leader sends to the followers asynchronously. This means that the followers will only be eventually read-after-write consistent.
Scaling and Geography
While the last section addresses scaling a a given data center, facebook has billions of users widely spread over multiple continents and geographies. If there is a request that can come for a popular shard from Asia, but the shard is served by a follower hosted in a data center in US, then the read latency on such objects will be significant. To address this, a shard can be hosted by a slave region in Asia that has a replica DB, followers and leaders.
At a high level, slave region works the following purpose:
- Slave followers can continue to serve the read-hits from their own caches.
- Slave followers need to go the slave leader for read-misses, which will go the replica database for getting the value
- The writes go to slave leader and then to the master leader and then master DB synchronously. Master DB is replicated to the slave DB.
- The replication link triggers the slave leader updates, which in turn triggers this slave followers invalidation messages.
Since consistency got discussed in detail earlier, let’s look at what happens when any of these major components fail.
These are handled by aggressive timeouts and routing around them in case of lack of responses. Hosts are marked as down and undergo further diagnostics if they are not responding.
If a master goes down, one of the slaves is promoted to the master role. A slave database failure can be addressed by going to the leader in the master region.
When a leader server fails, other servers in the leader tier are used for routing around it. Followers just send requests to a random leader in that tier. Also read misses can be served by going to the local database.
When a follower fails, other tiers can help with serving the shard. Each client is configured to have a primary and a backup follower tier.
I have come across a lot of literature on Key-Value stores of different types such as Cassandra, BigTable, C-store etc., but hadn’t really followed in detail the challenges and the need for graph databases. This paper summarizes that very well. Also the geography based scaling seems like a good approach for a truly distributed database. In general, i have heard a lot more buzz in the industry about Kafka for pub-sub, Cassandra than about graph databases like Neo4j or TAO. Given how personalized the web is getting, this seems counter intuitive to me.