Improve user experience: solving core data inconsistencies at Pinterest
Zhihuang Chen | Software Engineer, Core-Services
Challenges naturally occur with Pinterest’s rapid growth. As a Pinner, you might have noticed some instances where your data doesn’t look “correct,” and you may have had a negative experience because of it. For example: the “Pin count” in your profile shows the incorrect number of Pins, as shown in the left picture.
We call these referential integrity scenarios “data inconsistencies” because data stored in one backend doesn’t match the data stored in another backend. As you can see, those inconsistencies directly influence Pinners’ experience because we are showing outdated content to them. Also, these inconsistencies contribute a lot to our team’s maintenance cost. If a Pinner found their data is wrong, they might report this to our operations team. When the operations team is unable to solve these problems using regular operations, they would create bug tickets for us. In order to improve user experience, as well as reduce team’s maintenance work, we hope to build a tool that automatically detects and fixes these inconsistencies.
This project focuses on tackling the inconsistencies among our three core models: Pinners, boards, and Pins.
We use MySQL as our major datastore to store content created by Pinners. To store billions of Pins, boards, and other data for hundreds of millions of Pinners, many MySQL database instances form a MySQL cluster, which is split into logical shards to manage and serve the data more efficiently. All data are split across these shards.
Below is a simplified relationship among these three models we stored in MySQL:
Unfortunately, due to asynchronous processing of data in multiple flows, data inconsistency is unavoidably brought into our data store.
Our databases are sharded and some of the writes are asynchronized, which means sometimes we can’t do the update in one transaction. For example, when you create a Pin, it should insert into the Pins table, Pinner-Pin relationship table, and board-Pin relationship table. Ideally we do these three updates in one transaction, but sometimes the three tables we want to update don’t exist in the same shard, or the three updates are asynchronized. Inconsistency happens if we write to one table successfully but others fail.
We want to solve this problem by building a tool for auto-detection and auto-resolution of data inconsistencies. The most important part of this tool consists of two parts:
- Existence validation jobs, shown in the pink box, are used to check the existence of data in the database. The existence checking jobs will be triggered on every write to three core objects (Pinners, boards, and Pins).
- Stat validation jobs, shown in the orange box, are used to check stats accuracy . One of the stats checking jobs checks user stats, which stores the number of public Pins of a Pinner. The other checks board stats, which stores the number of Pins a board has.
Existence checking jobs are lighter than stats checking jobs because they only need to query if data is in the table or not. Stats checking jobs involve more computation as they need to pull exact data since we are validating a specific number. For example, when checking board stats, we need to get all Pins of the board in order to calculate this number. All of these jobs are asynchronous jobs that use Pinlater, which is a Pinterest in-house job scheduling and execution tool. Compared with other job scheduling tools we have, Pinlater is the most light-weighted and provides high throughput, adjustable dequeue rate. Beyond that, the enqueue to dequeue latency is pretty low, near real time.
As shown in the diagram above, the whole flow is:
- This tool is triggered when our service detects the write operation of core objects, and it enqueues a proxy job with some parameters such as unique object id, operation type, and additional parameters.
- Then, this proxy job will enqueue one of the existence checking jobs.
- If a Pin is created, boardstats should increase by one. If a Pin is deleted, boardstats should decrease by one. That is, some operations could influence stats, so we might want to enqueue stats jobs too.
- Once these async jobs were dequeued and executed, the job logic will check databases and caches to make sure data is consistent.
- If it detects any inconsistency, it will fix the inconsistency and re-enqueue the job to check again to make sure it is consistent.
Deferred and Limited Job Execution
One thing worth noting is that jobs are delayed to be executed. Why? First, some jobs are enqueued before the update happens because we want to make sure these jobs are enqueued even if the update failed because of database issues or network failure. Therefore, jobs need to wait for the update to be executed. Second, after the database is updated, there might be some async work to do such as cleaning cache or deleting Pins if it’s a board deletion. Also, different jobs may have different delayed time.
Another important part is limited stats job execution. So first, what’s this? As I said before, board stats will be modified if a Pin is created or deleted, thus, every Pin creation or deletion will trigger a possible enqueue of the boardstats checking job once. And if you deleted a board containing 100 Pins, the board stats job for this board_id will be enqueued 100 times. This is a waste of computing resources and brings extra load on our services as each job will query data online. To solve this issue, we use memcache to store the IDs we have enqueued and check if the ID is in the cache or not. If it’s in the cache, which means there is already a job enqueued, we don’t need to enqueue again. When stats checking job is dequeued and executed, it will delete the cache entry so next time it can be enqueued again. Even if it fails to delete the cache, cache has a TTL so the ID would be removed once it expires.
We haven’t received any customer support tickets for six months as inconsistencies were fixed automatically. Also, all new inconsistencies introduced into our system were fixed within 24 hours.
Thanks to Kapil Bajaj, Qi Li, and the rest of the Core Services team at Pinterest!