Abhi Khune | Pinterest engineer, Infrastructure
Our follower graph has millions of nodes and billions of edges, making it an interesting challenge to maintain and scale data as we build out the interest graph. The model is similar to those of Twitter or Facebook, but with some key differences based around interests that we account for in the product development and design phases.
The final version of the Pinterest follower service was developed, migrated and deployed in about 8 weeks with one full time engineer and 2–3 part time engineers.
Here I’ll explain how the service-oriented architecture has helped us develop and maintain the service as a unit of its own.
The Pinterest following model and interest graph
Facebook’s primary follower relationship is mutual and between two users, while Twitter’s is typically one-to-many. On Pinterest, following can be one-way, but goes beyond the individual and extends to interests. As people follow more boards and people, their home feed becomes more tailored to their interests, and the interest graph builds out further.
For example, if Andrea follows Bob, she’ll follow all of his boards, and if he creates a new board, she’ll automatically follow that board. If Andrea follows Bob’s Recipes board, she’ll see all of his pins from that board in her home feed. Andrea will also be listed as a follower of that board. We term the board followers as implicit followers (while the previous type of user-to-user follower is an explicit follower).
Action: follow user
Action: follow board
Action: unfollow board
Implicit relationships (in the reverse direction)
The follower model may seem complicated, but it allows for an easier experience for the pinner.
Low latency for common queries such as “Does user A follow user B?”, and acceptable latency for paging through the entire list of potentially millions of followers is needed. The UI also shows the accurate counts and paginated lists of a user’s followers and followings and board’s followers. We want acceptable latencies because the scenarios that page through all the followers are typically part of offline jobs like fanning out the pins.
Almost all pages either show a count for followers/following, or perform checks to see if the logged-in user follows the board or user being viewed. This requires the follower service to handle high throughputs and scale with the site. Ideally the service should leave headroom for additional throughput to support internal experiments, as well as adhere to the common distributed systems patterns: be horizontally scalable, be highly available, no data loss, and have no single points of failure (SPOF).
Our biggest challenges while building the service
- We didn’t find any off-the-shelf open source projects that met our requirements (such as efficient graph operations at this scale).
- The traditional model of storing the sharded graph on MySQL and caching it with memcached was reaching its limits.
- Caching the graph data is hard because the cache is useful only if the entire subgraph of a user (vertex) is in cache, however this can quickly result in an attempt to cache the entire graph!
- Caching also implies that queries like “Does user A follow user B?” are either in the cache or not. But more often than not the answer is ‘No’ (i.e. user A is not following user B), which results in a cache miss requiring expensive lookups to the persistent store.
The inner workings of the follower graph
The corpus size of the entire Pinterest follower graph is relatively small, so loading the entire graph in memory is feasible.
Redis stores the graph, which is sharded by user IDs, and the Redis AOF feature updates to disk every second to avoid significant data loss. Since Redis is single threaded, we run multiple instances of Redis to fully utilize the CPU cores. Each Redis instance serves one virtual shard, which allows easy splitting of shards when the instances on one machine reach capacity.
Digging into the data model
Before understanding the data model we chose, let’s sum up the common operations the follower service needs. We need to efficiently respond to point queries such as “Does user A follow user B”, support filtering queries used on a search page such as “Which of these 25 users are followed by me?”, and get all users followed by users or boards to fan out the incoming pins. In order to respond to such queries, we maintain these relationships per user:
- list of users who are followed explicitly by the given user (recall that explicitly means that all the current and future boards of the user are followed)
- list of users who are followed implicitly by the given user, i.e. one or more boards are followed by the user
- list of followers for a given user (explicit followers)
- list of followers for a given user’s one or more boards (implicit followers)
- list of boards followed explicitly by a given user
- list of boards unfollowed explicitly by a given user (many users follow another user but then unfollow a few boards that don’t match their interests)
We also maintain these relationships per board:
- board’s explicit followers
- board’s explicit unfollowers
The corresponding Redis data structures used to materialize these relationships are (per user):
- Redis SortedSet, with timestamp as the score, is used to store the users followed explicitly. We use a SortedSet for two reasons: first, the product requirements state that the users should be listed in reverse chronological order and second, having a sorted set allows us to paginate through the list of ids
- Redis SortedSet, with timestamp as the score, is used to store the users followed implicitly
- Redis SortedSet, with timestamp as the score, is used to store the user’s explicit followers
- Redis SortedSet, with timestamp as the score, is used to store the user’s implicit followers
- Redis Set is used to store boards followed explicitly
- Redis Set is used to store boards unfollowed explicitly
And these are relationships materialized per board:
- Redis Hash is used to store a board’s explicit followers
- Redis Set is used to store a board’s explicit unfollowers
The entire user id space is split into 8192 virtual shards. We place one virtual shard per Redis DB, and run multiple Redis instances (ranging from 8 to 32) on each machine depending on the memory and CPU consumption of the shards on those instances. Similarly, we run multiple Redis DBs per Redis instance.
When a Redis machine reaches either the memory or CPU thresholds, we split it either horizontally or vertically. Vertical sharding a Redis machine is simply cutting the number of running Redis instances on the machine by half. We bring up a new master as a slave of the existing master and once the slaving is complete, we make it the new master for half of the Redis instances leaving the old master as the master for the other half.
We use Zookeeper to store the shard configurations. Since Redis is single-threaded server, it’s important to be able to split the instances horizontally to fully utilize all the machine cores.
Avoiding the Panic Button: Backups and failure scenarios
We run our cluster in a Redis master-slave configuration, and the slaves act as hot backups. Upon a master failure, we failover the slave as the new master and either bring up a new slave or reuse the old master as the new slave. We rely on ZooKeeper to make this as quick as possible.
Each master Redis instance (and slave instance) is configured to write to AOF on Amazon EBS. This ensures that if the Redis instances terminate unexpectedly then the loss of data is limited to 1 second of updates. The slave Redis instances also perform BGsave hourly which is then loaded to a more permanent store (Amazon S3). This copy is also used by Map Reduce jobs for analytics.
As a production system, we need many failure modes to guard ourselves. As mentioned, if the master host is down, we will manually failover to slave. If a single master Redis instance reboots, monit restart restores from AOF, implying a 1 second window of data loss on the shards on that instance. If the slave host goes down, we bring up a replacement. If a single slave Redis instance goes down, we rely on monit to restart using the AOF data. Because we may encounter AOF or BGsave file corruption, we BGSave and copy hourly backups to S3. Note that large file sizes can cause BGsave induced delays but in our cluster this is mitigated by smaller Redis data due to the sharding scheme.
Lessons from late-night hacking
- Broad and deep coverage on unit tests saves time in the long run. It’s also ideal to plan a longer bake time for such service launches.
- We could have prevented bugs by having a language or framework that natively supported asynchronous call framework.
- We used Redis LUA feature in a small niche feature to maintain write consistency. We expected to encounter small sized sets but discovered that this may cause high CPU usage (strcpy) in a small fraction of users who had abnormally large number of boards unfollowed.
The master-slave based solution implies that in the event of a master failure, a few shards are unavailable for writes. One future improvement might be to make the master/slave failover automatic to further reduce the window when the master is unavailable for its writes. Ideally, we also want to invest in a repair capability in case of catastrophic failures in order to restore from the S3 backups. Another potential area of improvement is connection load balancing in our thrift client.
In the end, when we migrated away from the existing sharded MySQL cluster, we saved about 30% IOps.
Moving fast and building infrastructure and products that scale is one of the best parts of the job. We hope the lessons we’ve learned can help others out there!
Abhi Khune is an engineer at Pinterest and works on Infrastructure.