Partitioning Pinlater for Fun(?) and Profit(!)

Pinterest Engineering
Pinterest Engineering Blog
5 min readAug 27, 2020

Ernie Souhrada | Database Engineer & Bit Wrangler

Contained within the November 2015 announcement of the open-sourcing of Pinlater, our asynchronous job execution system, was a description of some of the improvements that had been made to the system since the original 2014 blog post describing its creation. Notable among these was the switch from Redis to MySQL/InnoDB as the backend data store due to its enhanced durability and ability to performantly handle a data set that is much larger than system memory. Usage of this new and improved Pinlater has grown by leaps and bounds over the past few years, exceeding the original aggregate processing rate of 100,000 jobs per second many times over with better reliability and at a much lower cost than what would have been incurred with a Redis-based solution. However, as you might expect, greater scale often brings new challenges.

Ask any experienced MySQL engineer whether or not they think InnoDB makes a good backing store for a queueing system that’s going to process even a moderately-high query volume, and you’re likely to receive reactions that run the gamut from abject disgust to uncontrollable laughter. Yes, it’s relatively easy to implement, and between auto-incrementing primary keys and row-level locking for concurrency control, many of the basic elements one might need in a queueing system are present by default, but there’s no such thing as a free puppy. You might think it’s cute and cuddly at first, but eventually it will get hungry, start barking and keep you up at night, or destroy your shoes and pee on your carpet if you don’t take it out for regular exercise. Mitigating the negative aspects of using MySQL as a queuing system, just like puppy maintenance, is a nontrivial undertaking.

Step 1: Find the Missing Disk Space

A complete accounting of all of the reasons why MySQL is not recommended for use as a queueing store is beyond the scope of this blog post; I’m going to focus instead on how we recently addressed one specific issue at the root of significant operational overhead and resource utilization inefficiency within Pinlater: wasted disk.

A notable property of InnoDB data files is that they never shrink on their own. Start with an empty table, write a million rows into it, and then look in the MySQL data directory to see how much space that file takes up on disk. Then, one by one, start deleting rows until there’s only one left. Even when you’ve finally deleted that last row, the size of the file will be the same as it was when the table was full. Do this enough times and eventually your server will run out of disk space, even though it’s barely holding any data. For a queueing system which is constantly writing new jobs to tens or even hundreds of different tables on any given MySQL server and then deleting old jobs as they are completed or are no longer needed, it’s not hard to see how bloated data files could pose a problem.

Our early solutions to this problem were simple and straightforward, but also rather suboptimal and unnecessarily labor-intensive: either rebuild the tables that had significant amounts of free space (potentially incurring a hit to Pinlater success rate, since writes to the table would be blocked), or set the alerting cluster(s) to stop enqueueing new jobs and provision new clusters to replace them.

Step 2: ????

Fast-forward to 2018. As part of a larger Pinlater efficiency project, I started taking a look at just how much disk space was actually being wasted as a result of the aforementioned InnoDB behavior, and the numbers were startling. In aggregate, all of the Pinlater tables which had at least 5GB of reclaimable space were occupying 100TB on disk, but they only contained a mere 3.5TB of actual data. Certainly, this could have been addressed by automating some of the reclamation tasks I described above, but I proposed we try something different: MySQL Partitioning.

To illustrate, here is an abbreviated schema definition of a Pinlater table prior to partitioning.

CREATE TABLE jobs (
local_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
state TINYINT NOT NULL,

updated_at TIMESTAMP NOT NULL,
created_at TIMESTAMP NOT NULL,
run_after TIMESTAMP NOT NULL,
INDEX (state, run_after)
)

The basic job flow within Pinlater is that each job begins in state 0 (NEW), and then once claimed by a worker, transitions to state 1 (IPR = in-progress), and ultimately finishes in either state 2 (OK) or state 3 (FAIL). Periodically, jobs which have succeeded or failed are purged from the table. In the original design, this was done with a simple DELETE query that often led to undesirable lock contention or CPU spikes and, as discussed previously, did nothing to recycle the unused disk space.

Now here’s an abbreviated schema from my proposed solution.

CREATE TABLE jobs (
local_id BIGINT NOT NULL AUTO_INCREMENT,
state TINYINT NOT NULL,

updated_at DATETIME NOT NULL,
created_at TIMESTAMP NOT NULL,
run_after TIMESTAMP NOT NULL,
PRIMARY KEY (local_id, state, updated_at),
INDEX (run_after),
) PARTITION BY RANGE COLUMNS (state, updated_at) (
PARTITION pNEW_2020051200_120 VALUES LESS THAN(0, ‘2020–05–12 00:00:00’),
PARTITION pNEW_2020051700_120 VALUES LESS THAN(0, ‘2020–05–17 00:00:00’),
PARTITION pNEW_FUTURE VALUES LESS THAN(0, MAXVALUE),
PARTITION pIPR_2020050700_24 VALUES LESS THAN (1,’2020–05–07 00:00:00'),
PARTITION pIPR_2020050800_24 VALUES LESS THAN (1,’2020–05–08 00:00:00'),
PARTITION pIPR_2020050900_24 VALUES LESS THAN (1,’2020–05–09 00:00:00'),
PARTITION pIPR_FUTURE VALUES LESS THAN (1, MAXVALUE),
PARTITION pOK_2020050700_1 VALUES LESS THAN (2,’2020–05–07 00:00:00'),
PARTITION pOK_2020050701_1 VALUES LESS THAN (2,’2020–05–07 01:00:00'),
PARTITION pOK_2020050702_1 VALUES LESS THAN (2,’2020–05–07 02:00:00'),
PARTITION pOK_FUTURE VALUES LESS THAN (2, MAXVALUE),
PARTITION pFAIL_2020050700_2 VALUES LESS THAN (3,’2020–05–07 00:00:00'),
PARTITION pFAIL_2020050702_2 VALUES LESS THAN (3,’2020–05–07 02:00:00'),
PARTITION pFAIL_2020050704_2 VALUES LESS THAN (3,’2020–05–07 04:00:00'),
PARTITION pFAIL_FUTURE VALUES LESS THAN (3, MAXVALUE),
PARTITION p_CATCHALL VALUES LESS THAN (MAXVALUE, MAXVALUE)
)

With the modified design, all of the same state changes take place over the life of a job, but instead of Pinlater merely updating a value in the row, the row itself is moved from one partition to the next as the state transitions occur. When it’s time to purge old/expired jobs, we simply can run a series of ALTER TABLE foo DROP PARTITION bar queries. As part of this maintenance cron, we also repartition each FUTURE partition to create new, empty partitions to replace the non-empty ones that were removed. No more contentious DELETE, and since the partition’s data file is actually removed from the filesystem, precious disk space that would otherwise be wasted can be reclaimed for use in other job queues. As an additional bonus, since even new jobs were given a finite TTL, if someone created a queue and then forgot about it, the maintenance script would automatically clean it up.

Step 3: Profit!

Did it work? Better than I could have expected. We rolled out this new schema, updated SQL (because in order to take advantage of MySQL’s partition pruning, where it can quickly narrow down the search to only the partition(s) necessary to satisfy the query, you need to use every component of the partitioning key in your query), and the new maintenance job in the first half of 2019. At the start of this project, 235 MySQL clusters were used for Pinlater. At the end? Less than half that number: 113. Not only that, but backup speed improved and now our disk utilization metrics for these servers were actually useful, since we were no longer counting long-deleted rows. Not bad!

Acknowledgements

Thanks to Isabel Tallam and Purajit Malalur from the Core Services team for their work on the requisite code changes to Pinlater and the subsequent testing and deployment.

--

--